Primary key question - database

Is there a benefit to having a single column primary key vs a composite primary key?
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me to throw in a third column that would be unique on it's own?

Database Normalization nuts will tell you one thing.
I'm just going to offer my own opinion of what i've learned over the years..I stick an AutoIncrementing ID field to Every ($&(##$)# one of my tables. It makes life a million times easier in the long run to be able to single out with impunity a single row.
This is from a "down in the trenches" developer.

Single column keys are simple to write, simple to maintain, and simple to understand.
If you're going to have a huge number of rows - billions? - maybe saving a byte here and there will help.
But if you're not looking at extreme cases, optimizing for "simple" is often the best way to go.

If you are a coder and the database is nothing to you but a glorified object-store, then sure, by all means inject surrogate keys willy nilly. In fact go one better and just delegate all DB schema design and DB interaction to your favourite ORM and be done with it. Indeed, when I want a small or medium scale object-store, that's exactly what I do.
If you are approaching an information systems or information management problem, then it is a completely different story. When you start dealing with 10's (or more likely 100's) of millions of dirty records integrated from multiple sources, several or all of which are not under your control; at that point the seductive lure of an easy answer to the problems of 'identity' is a trap.
Yes you sometimes still introduce a surrogate key internally to allow for concise FK relationships and improved cache efficiency on covering indices; but, you gain those benefits at the cost of substantial pain at managing the natural-key/surrogate-key relationship.
In this case it will be important to make sure you don't allow the surrogate key to leak. Your public API's at the business-logic layer should use the natural-key, nothing above an document/record-cache should be aware of the existence of a surrogate key. Be aware that the cost of matching updates against the existing surrogate keys can be prohibitive, and a far larger scalability hit than the incremental cost of moving a few extra bytes per request over the internal network.
So in conclusion:
If the DB is just being used as an object-store: let the ORM worry about object identity, and there should almost certainly be a surrogate key.
If the DB is being used as a database: the introduction of a surrogate key is an engineering design decision with serious tradeoffs in both directions. The decision will need to be made on a case by case basis, with full recognition of the resulting costs to be accepted in exchange for the benefits gained either way.
Update
The 'convenience' of a surrogate key is really just the ability to punt on the question of identity. This is often necessary in a database, and reasonable in the caching layer as I allow, but beyond that it leads to brittle data designs. The problem is that identity is no something that has one correct answer. For non-trivial data-intensive systems you will routinely find yourself needing to work in terms of equivalence classes, rather than the reference identity, object-oriented programming lulls us into thinking is normal.
What it really comes down to is a realization that the whole concept of a 'primary key' is a fiction invented to help the relational model work efficiently; but, adopting a surrogate key, cements that fiction and makes the whole system brittle and inflexible. Business logic needs to be able to provide their own definitions of equality — sometimes four copies of the same file need to be considered four files, sometimes they should be considered indistinguishable from the original file; when you edit one of them, is that then a new file? the same file? The answer to both questions is of course yes, when... Working with natural keys provides this critical ability to work in terms of conceptual equivalence classes. If you let surrogate keys infect your business logic, you quickly lose this.

I have had to use multi-column primary keys in the past, and it became quite a nightmare very quickly.
If you have one table that references your first table, how does it contain that primary key? Now add another table that references only the second table but needs to find data in the first. Now another... on down the rabbit hole.
If you know that you will only have the one table, there's probably not an issue either way- use whichever represents your data better. But if you'll be using it in joins, you can lose performance pretty quickly.

Is there a benefit to having a single column primary key vs a composit[sic] primary key?
Yes. If the primary key also happens to be the clustered index, it is common that the clustered index is duplicated fully for each secondary index in the table. Therefore, having a fatter clustered index, which is what one would get with a composite, implies an increase in storage cost. Also, foreign references to this table would need to specify both fields to refer to a unique entry, which implies a further storage cost. There is also an arguably greater cost in development time because there is a slight increase in the complexity of the join.
On the other hand, depending on the distribution of the values of your two key fields, it may be the case that concurrent access to your table is greatly improved because chronologically-successive inserts could occur on different physical pages; this could be the case, for example, if your fields are time-independent (and non-monotonic like an auto-incrementer) like clientID, or something like that. This could be significant for performance in a high concurrency environment.
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me
to throw in a third column that would be unique on it's own?
If the most common way in which your table is queried is to specify those three fields as restrictions, then having all three in a composite key would likely be the fastest lookup.
And there is another important point that I almost forgot. Since having a composite key means that foreign references to this table from other tables must specify all fields in the key, it also means that some queries performed on the other table that required a restriction on one or more of the parts of the composite index of this table, can be performed without requiring a join. This could be considered similar to the concept of denormalization for the sake of performance (and arguably sacrificing a little ease of maintainability).

In general I prefer to have a surrogate key becasue there are very few truly good natural keys (key problem is not uniqueness but that they change over time) and the longer the natural key, the more it affects performance when used as a PK. If you have a natural key, you should create a unique index on it and then use the surrogate key as the PK used for joining to other tables. That enforces the uniqueness of the natural key data but fixes the problems of join performance and the extra time to update all child records when the natural key changes.
There is one case where I ignore this and that is a joining table. If it is a table that is used to enforce a many to many relationship and consists only of two surrogate keys from other tables, then you really gain nothing from adding a surrogate key. Typically the individual keys are used for joins not the PK and surrogate keys almost never change. In a joining table, I just add the two colmns I need and nothing else.

In most databases I know (MySQL, PostgreSQL) the composite key will generate an index. So if you specify your key as composite the DB should provide you an efficient way to lookup tuples from the DB using that key. I think it is the case for all DBs. I think you do not have to bother about performance there.

Don't use multi-column keys. They get very difficult to maintain, especially if the components of the key are not human-understandable.
Use an internally generated key instead.

Imagine you have a composite primary key (field1 and field2 for example) instead of just one autoincremental identifier. Clients' requirements are very changeable and after some development the client says that field2 is not compulsory and it can be nullable, it won't be possible to continue as the primary key of the table. Imagine this table is one of the most importants in your model. Then all the foreign keys should be changed if field 2 cannot be in the composite primary key. It's a nightmare changing the primary key all over the model.
As well if there is a lot of foreign keys I think is not a very good Idea to add several keys to each table just to make the link.

I'm not sure there's enough information for us to make your call for you. Here are a few observations that might be helpful though.
is the primary key a clustered index? Is the table referenced by other tables through a foreign key? If yes, then you may benefit from a single-column key, because that key will appear in those other tables. This is how you would save space.
If the table is not referenced by other tables, then you would be using extra space in your table without much additional benefit. And, if this table only contains the two columns now, then you would increase the table size by 50%.
If you use an extra column for the primary key, do not forget your natural key (the two-column key). Create a unique constraint on the composite key. You still want to maintain the integrity of the real data.

The decision should always be based on requirements and the intended meaning of the data. A table with only a single attribute key clearly enforces a different kind of constraint and implies that your table has a very different meaning to the same table with a multi attribute key. On the other hand adding an additional unique column would also be a waste of resources and add meaningless complexity if you don't actually need to use it anywhere.

One caveat to the auto-incrementing column is that it can give a false impression of uniqueness. Sure, your identity column is always unique, but that's just a meaningless value you've attached to the table. Unless you also have a unique constraint attached to the set of columns that represent the actual semantic primary key of the table, you have no guarantee of meaningful uniqueness.

Related

Creating a SQL database without defining primary key

So in my work environment we don't use a 'primary key' as defined by SQL Server. In other words, we don't right click a column and select "set as primary key".
We do however still have primary keys, we just use a unique ID column. In stored procedures we use these to access the data like you would in any relational database.
My question is, other than the built in functionality that comes with defining a primary key in SQL Server like Entity Framework stuff etc. Is there a good reason to use the 'primary key' functionality over just using a unique ID column and accessing your tables with that in your own stored procedures?
The biggest drawback I see (again other than being able to use Entity Framework and things like that) is that you have to mentally keep track or otherwise keep track of what ID relates to what tables.
There is nothing "special" about the PRIMARY KEY constraint. It's just a uniqueness constraint and you can achieve the same results by using the UNIQUE NOT NULL syntax to define your keys instead.
However, uniqueness constraints (i.e. keys in general, not "primary" keys specifically) are very important for data integrity reasons. They ensure that your data is unique which means that sensible, meaningful results can be derived from your data. It's extremely difficult to get accurate results from a database that contains duplicate data. Also, uniqueness constraints are required to enforce referential integrity between tables, which is another very important aspect of data integrity. Poor data integrity is a data management problem that costs businesses billions of dollars every year and that's the bottom line of why keys are important.
There is a further reason where unique indexes are important: query optimization and performance. Unique indexes improve query performance. If your data is supposed to be unqiue then creating a unique index on it will give the query optimizer the best chance of picking a good execution plan for your queries.
I think the drawback is not using the primary key at all and using a unique key constraint for something it wasn't intended to do.
Unique keys: You can have many of them. They are meant to offer a way to determine uniqueness among rows.
Primary key: like the Highlander, there can only be one. It's intended use is to identify the rows of the table.
I can't think of any good reason not to use a primary key. My opinion is that without a primary key, your table isn't actually a table. It's just a lump of data.
Follow Up: If you don't believe me, check out this guy who asked a bunch of DBA's if it was OK not to use a primary key.
Is it OK not to use a Primary Key When I don't Need one
There are philosophical and practical answers to your question.
The practical answer is that using the primary key constraint enforces "not null", and "unique". This protects you from application-level bugs.
The philosophical answer is that you want developers to operate at the highest possible level of abstraction, so that they don't have to stuff their brain full of detail when trying to solve problems.
Primary and foreign keys are abstractions that allow us to make assumptions about the underlying data model. We can think in terms of (business) entities, and their relationships.
In your workplace, you're forcing developers to think in terms of tables and indexes and conventions. You no longer think about "customers" and "orders" and "line items", but about software artefacts that represent those business entities, and the "we always represent uniqueness by a combination of a GUID and unique index" rule. That mental model is already complicated enough in most applications; you're just making it harder for yourselves, especially when bringing new developers into the team.

What are the benefits of finding a natural primary key

My question is more or less the opposite of this one: Why would one ever want to bother finding a natural primary key in a relation when using a sequence as a surrogate seems so much easier.
BradC mentioned in his answer to a related question that the criteria for choosing a primary key are uniqueness, irreductibility, simplicity, stability and familiarity. It looks to me like using a sequence sacrifices the last criterion in order to provide an optimal solution for the first four.
If I hold those criteria to be correct, I can reformulate my question as: In which circumstances would one ever consider it advantageous to complicate one's life by looking for a unique, irreductible, simple and stable key that is also familiar?
To get a meaningful value from a lookup table without doing unnecessary joins.
Example case: garments references a lookup table of colors, which has an auto-increment primary key. Getting the name of the color requires a join:
SELECT c.color
FROM garments g
JOIN colors c USING (color_id);
Simpler example: the colors.color itself is the primary key of that table, and therefore it's the foreign key column in any table that references it.
SELECT g.color
FROM garments g
The answer is data integrity. Instances of entities in the business domain outside the database are by definition identifiable things. If you fail to give them external, real world identifiers in the database then that database stands little chance of modelling reality correctly.
A natural key[1] is what ensures facts in the database are identifiable with actual things in the reality you are trying to model. They are the means which users rely on when they act on and update the data in the database. The constraints that enforce those keys are an implementation of business rules. If your database is to model the business domain accurately then natural keys are not just desirable but essential. If you doubt that then you haven't done enough business analysis. Just ask your customers how they think their business would operate if they were left looking at screens full of duplicate data!
[1] I recommend calling them business keys or domain keys rather than natural keys. Those are far more appropriate and less overloaded terms even though they mean exactly the same thing.
You generally need to identify what the unique key on the data is anyway, as you still need to be able to ensure that the data is not duplicated.
The strength of the synthetic key is that it allows the values of the unique natural key to be modifiable in future, with child records not needing to be updated.
So you're not really skipping the "identify the key" part of the design by using a synthetic primary key, you're just insulating yourself from the possibility of the values changing.
Below are the benefits of using a natural primary key:
In case you need to have a unique constraint on any column then making it primary key will fulfill the need for that,if you aren't suppose to receive any null value into that.So, anyways it's saving your cost of 1 extra key.
In some RDBMS, the key you are declaring as primary key is automatically creating a btree index on that column and if you make a natural primary key based on your access pattern then it is like Icing on the cake because now you are making two shots with one stone. Saving cost of an extra index and making your queries faster by having that meaningful primary key in where clause.
Last but not least ,you will be able to save space of one extra column/key/index.

Good Database-Design has no identity-columns in the tables, right?

what is the benefit of having an additional identity column in each table of a database? What are the drawbacks?
Update:
Now i want to expand the case and introduce replication. What for is this surogate key (identity comlumn) aside of the rowguid we get with the replication. Behalf of K. Brian Kelley's objection one should set the clustered index on this rowguid (and forget the identity-column). WHat do you think?
Short version: Surrogate or synthetic key (what you probably mean by "identity column") versus natural key is a very old debate.
Pros of surrogate keys:
makes you independent of changes in the natural key (think new requirements / changing domain model), which would otherwise cascade through your data model
sometimes, there is no natural key (e.g. persons in an address book)
is often faster, because it is shorter (single int column, instead of e.g. several varchars)
is more convenient in joins etc. because it's only a single column
Cons:
you still need a unique index for every candidate natural key (so one more index needed)
it is foreign to the domain, and may require an additional join to get the real data
Generally, the agreement is that surrogate keys are usually a good idea, except in simple cases like join tables.
For all the details, see Wikipedia, which has a good article on the topic.
Often the use of surrogate keys, of which IDENTITY columns are the most common, are done for performance reasons. You should still identify the natural keys (the columns which make the row unique based on the data).
Typically surrogate keys are integer values. That makes them easy to link together using joins with other tables (and restrict accordingly using foreign keys). Also, when talking about SQL Server, all nonclustered indexes depend on the clustered index. So if the clustered index is based on the natural key and is large in size as a result, then all the nonclustered indexes are going to be large as well, because they will refer back to the clustered index. Therefore, a lot of folks build the primary key around that integer-based surrogate key. I know I'm simplifying it a bit, but that's a key reason for the use of surrogate keys.
The drawback is that the surrogate key is effectively meaningless. If someone were to change the value of the key, you could break a relationship if the foreign key constraints are not present or are disabled. In the case of making a change which alters the alternate key, you're actually changing the data itself. So you would expect such a break if the entities are built properly and you would be changing the data in related tables as well.

Picking the best primary key + numbering system

We are trying to come up with a numbering system for the asset system that we are creating, there has been a few heated discussions on this topic in the office so I decided to ask the experts of SO.
Considering the database design below what would be the better option.
Example 1: Using auto surrogate keys.
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
1 1
Example 2: Using program generated PK
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
"RD00000001WCK" "00000001.1"
(the 00000001.1 means it's the first segment of the road. This increases everytime you add a new segment e.g. 00000001.2)
Example 3: Using a bit of both(adding a new column)
======================= ==========================
ID(PK) Road_Number(UK) ID(PK) Segment_Number(UK)
======================= ==========================
1 "RD00000001WCK" 1 "00000001.1"
Just a bit of background information, we will be using the Road Number and Segment Number in reports and other documents, so they have to be unique.
I have always liked keeping things simple so I prefer example 1, but I have been reading that you should not expose your primary keys in reports/documents. So now I'm thinking more along the lines of example 3.
I am also leaning towards example 3 because if we decide to change how our asset numbering is generated it won't have to do cascade updates on a primary key.
What do you think we should do?
Thanks.
EDIT: Thanks everyone for the great answers, has help me a lot.
This is really a discussion about surrogate (also called technical or synthetic) vs natural primary keys, a subject that has been extensively covered. I covered this in Database Development Mistakes Made by AppDevelopers.
Natural keys are keys based on
externally meaningful data that is
(ostensibly) unique. Common examples
are product codes, two-letter state
codes (US), social security numbers
and so on. Surrogate or technical
primary keys are those that have
absolutely no meaning outside the
system. They are invented purely for
identifying the entity and are
typically auto-incrementing fields
(SQL Server, MySQL, others) or
sequences (most notably Oracle).
In my opinion you should always
use surrogate keys. This issue has
come up in these questions:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Which format of primary key would you use in this situation.
Surrogate Vs. Natural/Business Keys
Should I have a dedicated primary key field?
Auto number fields are the way to go. If your keys have meaning outside your database (like asset numbers) those will quite possibly change and changing keys is problematic. Just use indexes for those things into the relevant tables.
I would personally say keep it simple and stay with an autoincremented primary key. If you need something more "Readable" in terms of display in the program, then possibly one of your other ideas, but I think that is just adding unneeded complexity to the primary key field.
I'm also very strongly in the "don't use primary keys as meaningful data" camp. Every time I have contravened that policy it has ended in tears. Sooner or later the meaningful data needs to change and if that means you have to change a primary key it can get painful. The primary key will probably be used in foreign key constraints and you can spend ages trying to sort it all out just to make a simple data change.
I always use GUIDs/UUIDs for my primary keys in every table I ever create but that's just personal preference serials or such are also good.
Don't put meaning into your PK fields unless...
It is 100% completely impossible that
the value will never change and that
No two people would ever reasonably
argue about which value should be
used for a particular row.
Go with option one and format the value in the app to look like option two or three when it is displayed.
I think the important thing to remember here is that each table in your database/design might have multiple keys. These are the Candidate Keys.
See wikipedia entry for Candidate Keys
By definition, all Candidate Keys are created equal. They are each unique identifiers for the table in question.
Your job then is to select the best candidate from the pool of Candidate Keys to serve as the Primary Key. The Primary Key will be used by other tables to establish the relational constraints, but you are free to continue using Candidate Keys to query the table.
Because Primary Keys are referenced by other structures, and therefore used in join operations, the criteria for Primary Key selection boils down to the following for me (in order of importance):
Immutable/Stable - Primary Key values should not change. If they do, you run the risk of introducing update anomolies
Not Null - most DBMS platforms require that the Primary Key attribute(s) are not null
Simple - simple datatypes and values for physical storage and performance. Integer values work well here, and this is the datatype of choice for most surrogate/auto-gen keys
Once you've identified the Candidate Keys, the criteria above can be used to select the Primary Key. If there is not a "Natural" Candidate Key meets the criteria, then a Surrogate Key that does meet the criteria can be created and used as mentioned in other answers.
Follow the Don't Use policy.
Some problems you can run into:
You need to generate keys from more than one host.
Someone will want to reserve contiguous numbers to use together.
How meaningful will people want it to be? Wars are fought over this, and you're in the first skirmish of one already. "It's already meaningful, and if we just add two more digits we can ..." i.e. you're establishing a design style that will (should) be extensible.
If you are concatenating the two, you're doing typecasts which can mess up your query Optimizer.
You'll need to reclassify roads, and redefine their boundaries (i.e. move the roads), which implies changing the primary key and maybe losing links.
There are workarounds for all this, but this is the kind of issue where workarounds proliferate and get out of control. And it doesn't take more than a couple to get beyond "Simple".
As mentioned before, keep your internal primary keys as just keys, whatever the most optimal datatype is on your platform.
However you do need to let the numbering system argument be fought out, as this is actually a business requirement, and perhaps let's call it an identification system for the asset.
If there is only going to be one identifier, then add it as a column to the main table. If there are likely to be many identification systems (and assets usually have many), you'll need two more tables
Identifier-type table Identifier-cross-ref table
type-id ------------> type-id (unique
type-name identifier-string key)
internal-id
That way different people who need to access the asset can identify in their own way. For example the server team will identify a server differently from the network team and different again from project management, accounts, etc.
Plus, you get to go to all the meetings where everyone argues with each other.
Another thing to keep in mind is that if you're importing alot of data into this system, you may find out that things like Road_Number are not as unique as you thought, and there may be operational roadblocks to fixing the problem (repainting road signs, etc.) .
While natural keys may have great meaning to the business users, if you do not have the agreement that those keys are sacred and should not be altered, you will more than likely be pulling your hair out while maintaining a database where the "product codes have to be changed to accommodate the new product line the company acquired." You need to protect the RI of your data, and integers as primary keys with auto-increment are the best way to go. Performance is also better when indexing and traversing integers than char columns.
While not appropriate as primary keys, natural keys are very appropriate for user consumption and you can enforce uniques via an index. They bring a context to the data that will make it easier for all parties to understand. Also, in the advent that you need to reload data, the natural keys can help verify that your lookups are still valid.
I would go with the surrogate key, but you may want to have a computed column that "formats" the surrogate key into a more "readable" value if that improves your reporting. The computed colum could produce example 2 from the surrogate key for instance for display purposes.
I think the surrogate key route is the way to go and the only exceptions that I make for it are join tables, where the primary key could be composed of the foreign key references. Even in these cases I'm finding that having a surrogate primary key is more useful than not.
I suspect that you really should use option #3, as many here have already said. Surrogate PKs (either Integers or GUIDs) are good practice, even if there are adequate business keys. Surrogates will reduce maintenance headaches (as you yourself have already noted).
That being said, something you may want to consider is whether or not your database is:
focused on data maintenance and transactional processing (i.e. Create/Update/Delete operations)
geared towards analysis and reporting (i.e. Queries)
In other words, are the users concerned with maintaining active data or querying largely static data to find answers?
If you are heavily focused on building an analysis and reporting DB (e.g. a data warehouse/mart) that is exposed to technical business users (e.g. report designers) who have a good grasp of the business vocabulary, then you might want to consider using natural keys based on meaningful business values. They help reduce query complexity by eliminating the need for complex joins and help the user focus on their task, not fighting the database structure.
Otherwise you're probably focused on a full CRUD DB that has to cover all the bases to some degree - this is the vast majority of situations. In which case, go with your option #3. You can always optimize for queryability in the future but you'll be hard pressed to retrofit for maintainability.
I hope you will agree with me that every design element should have single purpose.
Question is what do you think is purpose of PK? If it is to identify unique record in a table, then surrogate keys wins without much trouble. This is simple and straight.
As far as new columns in option 3 are concerned, you should check if these can be calculated (best would be to do calculation in model layer so that they can be changed easily than if calculation done in RDBMS) without too much of performance penalty from other elements. For example, you can store segment number and road number in corresponding tables and then use them to generate "00000001.1". This will allow to change asset numbering on-the-fly.
First off, option 2 is the absolute worst option. As an Index, it's a string, and that makes it slow. And it's generated based on business rules - which can change and cause a rather large headache.
Personally, I always use a separate primary key column; and I always use a GUID. Some developers prefer a simple INT over a GUID for reasons of hard-drive space. However, if the situation arises where you need to merge two databases, GUIDs will almost never collide (whereas INTs are guaranteed to collide).
Primary Keys should NEVER be seen by the user. Making it readable to the user should not be a concern. Primary Keys SHOULD be used to link with Foreign Keys. This is their purpose. The value should be machine readable and, once created, never changed.

Surrogate vs. natural/business keys [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here we go again, the old argument still arises...
Would we better have a business key as a primary key, or would we rather have a surrogate id (i.e. an SQL Server identity) with a unique constraint on the business key field?
Please, provide examples or proof to support your theory.
Just a few reasons for using surrogate keys:
Stability: Changing a key because of a business or natural need will negatively affect related tables. Surrogate keys rarely, if ever, need to be changed because there is no meaning tied to the value.
Convention: Allows you to have a standardized Primary Key column naming convention rather than having to think about how to join tables with various names for their PKs.
Speed: Depending on the PK value and type, a surrogate key of an integer may be smaller, faster to index and search.
Both. Have your cake and eat it.
Remember there is nothing special about a primary key, except that it is labelled as such. It is nothing more than a NOT NULL UNIQUE constraint, and a table can have more than one.
If you use a surrogate key, you still want a business key to ensure uniqueness according to the business rules.
It appears that no one has yet said anything in support of non-surrogate (I hesitate to say "natural") keys. So here goes...
A disadvantage of surrogate keys is that they are meaningless (cited as an advantage by some, but...). This sometimes forces you to join a lot more tables into your query than should really be necessary. Compare:
select sum(t.hours)
from timesheets t
where t.dept_code = 'HR'
and t.status = 'VALID'
and t.project_code = 'MYPROJECT'
and t.task = 'BUILD';
against:
select sum(t.hours)
from timesheets t
join departents d on d.dept_id = t.dept_id
join timesheet_statuses s on s.status_id = t.status_id
join projects p on p.project_id = t.project_id
join tasks k on k.task_id = t.task_id
where d.dept_code = 'HR'
and s.status = 'VALID'
and p.project_code = 'MYPROJECT'
and k.task_code = 'BUILD';
Unless anyone seriously thinks the following is a good idea?:
select sum(t.hours)
from timesheets t
where t.dept_id = 34394
and t.status_id = 89
and t.project_id = 1253
and t.task_id = 77;
"But" someone will say, "what happens when the code for MYPROJECT or VALID or HR changes?" To which my answer would be: "why would you need to change it?" These aren't "natural" keys in the sense that some outside body is going to legislate that henceforth 'VALID' should be re-coded as 'GOOD'. Only a small percentage of "natural" keys really fall into that category - SSN and Zip code being the usual examples. I would definitely use a meaningless numeric key for tables like Person, Address - but not for everything, which for some reason most people here seem to advocate.
See also: my answer to another question
Surrogate key will NEVER have a reason to change. I cannot say the same about the natural keys. Last names, emails, ISBN nubmers - they all can change one day.
Surrogate keys (typically integers) have the added-value of making your table relations faster, and more economic in storage and update speed (even better, foreign keys do not need to be updated when using surrogate keys, in contrast with business key fields, that do change now and then).
A table's primary key should be used for identifying uniquely the row, mainly for join purposes. Think a Persons table: names can change, and they're not guaranteed unique.
Think Companies: you're a happy Merkin company doing business with other companies in Merkia. You are clever enough not to use the company name as the primary key, so you use Merkia's government's unique company ID in its entirety of 10 alphanumeric characters.
Then Merkia changes the company IDs because they thought it would be a good idea. It's ok, you use your db engine's cascaded updates feature, for a change that shouldn't involve you in the first place. Later on, your business expands, and now you work with a company in Freedonia. Freedonian company id are up to 16 characters. You need to enlarge the company id primary key (also the foreign key fields in Orders, Issues, MoneyTransfers etc), adding a Country field in the primary key (also in the foreign keys). Ouch! Civil war in Freedonia, it's split in three countries. The country name of your associate should be changed to the new one; cascaded updates to the rescue. BTW, what's your primary key? (Country, CompanyID) or (CompanyID, Country)? The latter helps joins, the former avoids another index (or perhaps many, should you want your Orders grouped by country too).
All these are not proof, but an indication that a surrogate key to uniquely identify a row for all uses, including join operations, is preferable to a business key.
I hate surrogate keys in general. They should only be used when there is no quality natural key available. It is rather absurd when you think about it, to think that adding meaningless data to your table could make things better.
Here are my reasons:
When using natural keys, tables are clustered in the way that they are most often searched thus making queries faster.
When using surrogate keys you must add unique indexes on logical key columns. You still need to prevent logical duplicate data. For example, you can’t allow two Organizations with the same name in your Organization table even though the pk is a surrogate id column.
When surrogate keys are used as the primary key it is much less clear what the natural primary keys are. When developing you want to know what set of columns make the table unique.
In one to many relationship chains, the logical key chains. So for example, Organizations have many Accounts and Accounts have many Invoices. So the logical-key of Organization is OrgName. The logical-key of Accounts is OrgName, AccountID. The logical-key of Invoice is OrgName, AccountID, InvoiceNumber.
When surrogate keys are used, the key chains are truncated by only having a foreign key to the immediate parent. For example, the Invoice table does not have an OrgName column. It only has a column for the AccountID. If you want to search for invoices for a given organization, then you will need to join the Organization, Account, and Invoice tables. If you use logical keys, then you could Query the Organization table directly.
Storing surrogate key values of lookup tables causes tables to be filled with meaningless integers. To view the data, complex views must be created that join to all of the lookup tables. A lookup table is meant to hold a set of acceptable values for a column. It should not be codified by storing an integer surrogate key instead. There is nothing in the normalization rules that suggest that you should store a surrogate integer instead of the value itself.
I have three different database books. Not one of them shows using surrogate keys.
I want to share my experience with you on this endless war :D on natural vs surrogate key dilemma. I think that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. So depending on your situation, it might be more relevant to choose one method or the other.
As it seems that many people present surrogate keys as the almost perfect solution and natural keys as the plague, I will focus on the other point of view's arguments:
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following points:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Myths on natural keys
Composite keys are less inefficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Conclusion
Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
Alway use a key that has no business meaning. It's just good practice.
EDIT: I was trying to find a link to it online, but I couldn't. However in 'Patterns of Enterprise Archtecture' [Fowler] it has a good explanation of why you shouldn't use anything other than a key with no meaning other than being a key. It boils down to the fact that it should have one job and one job only.
Surrogate keys are quite handy if you plan to use an ORM tool to handle/generate your data classes. While you can use composite keys with some of the more advanced mappers (read: hibernate), it adds some complexity to your code.
(Of course, database purists will argue that even the notion of a surrogate key is an abomination.)
I'm a fan of using uids for surrogate keys when suitable. The major win with them is that you know the key in advance e.g. you can create an instance of a class with the ID already set and guaranteed to be unique whereas with, say, an integer key you'll need to default to 0 or -1 and update to an appropriate value when you save/update.
UIDs have penalties in terms of lookup and join speed though so it depends on the application in question as to whether they're desirable.
Using a surrogate key is better in my opinion as there is zero chance of it changing. Almost anything I can think of which you might use as a natural key could change (disclaimer: not always true, but commonly).
An example might be a DB of cars - on first glance, you might think that the licence plate could be used as the key. But these could be changed so that'd be a bad idea. You wouldnt really want to find that out after releasing the app, when someone comes to you wanting to know why they can't change their number plate to their shiny new personalised one.
Always use a single column, surrogate key if at all possible. This makes joins as well as inserts/updates/deletes much cleaner because you're only responsible for tracking a single piece of information to maintain the record.
Then, as needed, stack your business keys as unique contraints or indexes. This will keep you data integrity intact.
Business logic/natural keys can change, but the phisical key of a table should NEVER change.
Case 1: Your table is a lookup table with less than 50 records (50 types)
In this case, use manually named keys, according to the meaning of each record.
For Example:
Table: JOB with 50 records
CODE (primary key) NAME DESCRIPTION
PRG PROGRAMMER A programmer is writing code
MNG MANAGER A manager is doing whatever
CLN CLEANER A cleaner cleans
...............
joined with
Table: PEOPLE with 100000 inserts
foreign key JOBCODE in table PEOPLE
looks at
primary key CODE in table JOB
Case 2: Your table is a table with thousands of records
Use surrogate/autoincrement keys.
For Example:
Table: ASSIGNMENT with 1000000 records
joined with
Table: PEOPLE with 100000 records
foreign key PEOPLEID in table ASSIGNMENT
looks at
primary key ID in table PEOPLE (autoincrement)
In the first case:
You can select all programmers in table PEOPLE without use of join with table JOB, but just with: SELECT * FROM PEOPLE WHERE JOBCODE = 'PRG'
In the second case:
Your database queries are faster because your primary key is an integer
You don't need to bother yourself with finding the next unique key because the database itself gives you the next autoincrement.
Surrogate keys can be useful when business information can change or be identical. Business names don't have to be unique across the country, after all. Suppose you deal with two businesses named Smith Electronics, one in Kansas and one in Michigan. You can distinguish them by address, but that'll change. Even the state can change; what if Smith Electronics of Kansas City, Kansas moves across the river to Kansas City, Missouri? There's no obvious way of keeping these businesses distinct with natural key information, so a surrogate key is very useful.
Think of the surrogate key like an ISBN number. Usually, you identify a book by title and author. However, I've got two books titled "Pearl Harbor" by H. P. Willmott, and they're definitely different books, not just different editions. In a case like that, I could refer to the looks of the books, or the earlier versus the later, but it's just as well I have the ISBN to fall back on.
On a datawarehouse scenario I believe is better to follow the surrogate key path. Two reasons:
You are independent of the source system, and changes there --such as a data type change-- won't affect you.
Your DW will need less physical space since you will use only integer data types for your surrogate keys. Also your indexes will work better.
As a reminder it is not good practice to place clustered indices on random surrogate keys i.e. GUIDs that read XY8D7-DFD8S, as they SQL Server has no ability to physically sort these data. You should instead place unique indices on these data, though it may be also beneficial to simply run SQL profiler for the main table operations and then place those data into the Database Engine Tuning Advisor.
See thread # http://social.msdn.microsoft.com/Forums/en-us/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be
This is one of those cases where a surrogate key pretty much always makes sense. There are cases where you either choose what's best for the database or what's best for your object model, but in both cases, using a meaningless key or GUID is a better idea. It makes indexing easier and faster, and it is an identity for your object that doesn't change.
In the case of point in time database it is best to have combination of surrogate and natural keys. e.g. you need to track a member information for a club. Some attributes of a member never change. e.g Date of Birth but name can change.
So create a Member table with a member_id surrogate key and have a column for DOB.
Create another table called person name and have columns for member_id, member_fname, member_lname, date_updated. In this table the natural key would be member_id + date_updated.
Horse for courses. To state my bias; I'm a developer first, so I'm mainly concerned with giving the users a working application.
I've worked on systems with natural keys, and had to spend a lot of time making sure that value changes would ripple through.
I've worked on systems with only surrogate keys, and the only drawback has been a lack of denormalised data for partitioning.
Most traditional PL/SQL developers I have worked with didn't like surrogate keys because of the number of tables per join, but our test and production databases never raised a sweat; the extra joins didn't affect the application performance. With database dialects that don't support clauses like "X inner join Y on X.a = Y.b", or developers who don't use that syntax, the extra joins for surrogate keys do make the queries harder to read, and longer to type and check: see #Tony Andrews post. But if you use an ORM or any other SQL-generation framework you won't notice it. Touch-typing also mitigates.
Maybe not completely relevant to this topic, but a headache I have dealing with surrogate keys. Oracle pre-delivered analytics creates auto-generated SKs on all of its dimension tables in the warehouse, and it also stores those on the facts. So, anytime they (dimensions) need to be reloaded as new columns are added or need to be populated for all items in the dimension, the SKs assigned during the update makes the SKs out of sync with the original values stored to the fact, forcing a complete reload of all fact tables that join to it. I would prefer that even if the SK was a meaningless number, there would be some way that it could not change for original/old records. As many know, out-of-the box rarely serves an organization's needs, and we have to customize constantly. We now have 3yrs worth of data in our warehouse, and complete reloads from the Oracle Financial systems are very large. So in my case, they are not generated from data entry, but added in a warehouse to help reporting performance. I get it, but ours do change, and it's a nightmare.

Resources