default value vs. null for foreign key - sql-server

I have a question about using null vs. default value for foreign key columns in database. I found a lot of opposite opinions about null vs. default values when designing databases but not exactly for foreign keys (what are main pros and cons).
Currently I'm designing a new database which will store a lot of data for different web applications and other systems with different data access approaches (ORM, stored procedures) and I want to implement general rules on the lowest level as possible (database). (So not to worry about this rules later in applications).
To give you an example let's say that I have a table of users User with foreign key column for his nationality NationalityID which is a primary key CountryID for table Country.
Now I have two/three options:
A: I allow NationalityID column (and all other similar foreign key columns in database) to be null and just stick with common approach of checking always and everywhere for null (applying rules in application)
or
B: I assign a default value for every foreign key to be let's say "-1" and put in every relation table additional column with "-1" as a key and all other data as "No data" (for this example in Country table I put column with CountryID of "-1" and for CountryName I set "No data"). So every time I will want to know users nationality I will always get result without additional code rules (no need for me to check if it's null or not).
or
C: I can disallow null value for foreign keys. But this is really something what I want to avoid. (I need to have an option to store at least basic data (users name) if not the additional data (users nationality))
So is B good approach or not? What am I missing here? Do I lose more that I gain with this approach? Which problems could I have (in addition to be careful to always have additional column in relational tables with ID value of "-1" which says there is "No data")?
What is your good/bad experience with foreign key default values?
thank you

If you normalize this won't be an issue.
Instead of putting nationality in the USER table, make a User_Nationality table that links users to Country_ID in the other table.
If they have an entry in that lookup table, great. If not, you don't need to store a NULL or default value for it.
You need to enforce FK relationships, and allowing NULL goes against that. You also don't want to make up information that may not be accurate just to populate a field, which negates the point of requiring the field in the first place.
Use lookup tables and you can bypass that entirely.
This will also allow you to change your mind and choose one of your options down the road.
If you use views, you can choose to treat missing data as a NULL or a default value without needing to alter the underlying data.

Personally, I would feel that even if you have a non-entry entry in your database with a key of -1, you would still be performing a check to see whether you want to display 'No Data' or not for each individual field.
I would stick to NULLs. NULL is meant to mean the absence of data, which is the case here.

B is a terrible approach. It is easier to remeber to handle nulls than to have to figure out what magic number you used and then you still have to handle them. Use number 1. But I like JNKs idea best.

I suggest option D. If not all users have a defined nationality then that information doesn't belong in the user table. Create a table called UserNationality keyed on UserId.

I like your B solution. Maybe it will be possible to map the values into other entities, so you have Country, and NullCountry that extends Country and is mapped to row with id=-1 and have special code in its methods to make it easy to handle special cases.
One problem is probably that it will be harder to do outer joins on that foreign key.
EDIT: no, there should be no problem with outer joins, because there would be no need to do outer joins.

Related

Should the contents of a column acting as a primary key be interpretable or purely unique integers

I have the luxury of designing a database from scratch. When designing columns to act as unique keys should I just use unique integers or should I attempt to make the values interpretable. So if I had a lookup table of ward names in a hospital should the id column contain unique codes that in someway relate to the name of the ward or just unique integers?
Resist the temptation to overload the id values with meaning. Use other attributes to store the info you're considering stuffing into the id.
Overloading the id with "meaning" is bad because:
If the data being stuffed into the ID changes, so must your ID. ID's should never change
If the data type of the data changes, you'll have a problem, for example:
If your ID is numeric, and the stuffed info changes from numeric to text, you'll have big problems
If the stuffed data changes from a simple field to a one-to-many child, your model will break
What you believe has "important" meaning now may not be important in the future. Then your "specially encoded" data will become useless and a burden, even a serious restriction
What currently "identifies" a product may change as the business evolves
If have seen this idea attempted many times, never successfully. In every case, the idea was scraped and surrogate IDs were introduced to replace the magic IDs, with all the risk and development cost associated with that task.
In my career, have seen most of the problems listed above actually happen.
You should not be using a lookup table. Make your tables innodb and use referential integrity to join tables together. Your id columns should always be set as primary and should be set to auto increment. Never try to make up your own ids. You should really look at some tutorial on referential integrity and learn how to assoicate tables with other tables.

NULL permitted in Primary Key - why and in which DBMS?

Further to my question "Why to use ´not null primary key´ in TSQL?"...
As I understood from other discussions, some RDBMS (for example SQLite, MySQL) permit "unique" NULL in the primary key.
Why is this allowed and how might it be useful?
Background: I believe it is beneficial for communication with colleagues and database professionals to know the differences in fundamental concepts, approaches and their implementations in different DBMS.
Notes
MySQL is rehabilitated and returned to the "NOT NULL PK" list.
SQLite has been added (thanks to Paul Hadfield) to "NULL PK" list:
For the purposes of determining the uniqueness of primary key values, NULL values are considered distinct from all other values, including other NULLs.
If an INSERT or UPDATE statement attempts to modify the table content so that two or more rows feature identical primary key values, it is a constraint violation. According to the SQL standard, PRIMARY KEY should always imply NOT NULL. Unfortunately, due to a long-standing coding oversight, this is not the case in SQLite.
Unless the column is an INTEGER PRIMARY KEY SQLite allows NULL values in a PRIMARY KEY column. We could change SQLite to conform to the standard (and we might do so in the future), but by the time the oversight was discovered, SQLite was in such wide use that we feared breaking legacy code if we fixed the problem.
So for now we have chosen to continue allowing NULLs in PRIMARY KEY columns. Developers should be aware, however, that we may change SQLite to conform to the SQL standard in future and should design new programs accordingly.
— SQL As Understood By SQLite: CREATE TABLE
Suppose you have a primary key containing a nullable column Kn.
If you want to have a second row rejected on the ground that in that second row, Kn is null and the table already contains a row with Kn null, then you are actually requiring that the system would treat the comparison "row1.Kn = row2.Kn" as giving TRUE (because you somehow want the system to detect that the key values in those rows are indeed equal). However, this comparison boils down to the comparison "null = null", and the standard already explicitly specifies that null doesn't compare equal to anything, including itself.
To allow for what you want, would thus amount to SQL deviating from its own principles regarding the treatment of null. There are innumerable inconsistencies in SQL, but this particular one never got past the committee.
I don't know whether older versions of MySQL differ on this, but as of modern versions a primary key must be on columns that are not null. See the manual page on CREATE TABLE: "A PRIMARY KEY is a unique index where all key columns must be defined as NOT NULL. If they are not explicitly declared as NOT NULL, MySQL declares them so implicitly (and silently)."
As far as relational database theory is concerned:
The primary key of a table is used to uniquely identify each and every row in the table
A NULL value in a column indicates that you don't konw what the value is
Therefore, you should never use the value of "I don't know" to uniquely identify a row in a table.
Depending upon the data you are modelling, a "made up" value can be used instead of NULL. I've used 0, "N/A", 'Jan 1, 1980', and similar values to represent dummy "known to be missing" data.
Most, if not all, DB engines do allow for a UNIQUE constraint or index, which does allow for NULL column values, though (ideally) only one row may be assigned the value null (otherwise it wouldn't be a unique value). This can be used to support the irritatingly pragmatic (but occasionally necessary) situations that don't fit neatly into relational theory.
Well, it could allow you to implement the Null Object Pattern natively within the database. So if you were using something similar in code, which interacted very closely with the DB, you could just look up the object corresponding to the key without having to special-case a null check.
Now whether this is worthwhile functionality I'm not sure, but it's really a question of whether the pros of disallowing null pkeys in absolutely all cases outweigh the cons of obstructing someone who (for better or worse) actually wants to use null keys. This would only be worth it if you could demonstrate some non-trivial improvement (such as faster key lookup) from being able to guarantee that keys are non-null. Some DB engines would show this, others might not. And if there aren't any real pros from forcing this, why artificially restrict your clients?
As discussed in other answers, NULL was intended to mean "the information that should go in this column is unknown". However, it is also frequently used to indicate an alternative meaning of "this attribute does not exist". This is a particularly useful interpretation when looking at timestamp fields that are interpreted as the time some particular event occurred, in which case NULL is often used to indicate that the event has not yet occurred.
It is a problem that SQL doesn't support this interpretation very well -- for this to work properly, it really needs to have a separate value (something like "never") that doesn't behave as null does ("never" should be equal to "never" and should compare as higher than all other values). But as SQL lacks this notion, and there is no convenient way to add it, using null for this purposes is often the best choice.
This leaves the problem that when a timestamp of an event that may have not occurred should be part of the primary key of a table (a common requirement perhaps being the use of a natural key along with a deletion timestamp when using soft deletion with a requirement for the ability to recreate the item after deletion) you really want the primary key to have a nullable column. Alas, this is not allowed in most databases, and instead you have to resort to an artificial primary key (e.g. a row sequence number) and a UNIQUE constraint for what should otherwise have been your actual primary key.
An example scenario, in order to clarify this: I have a users table. As I require each user to have a distinct username, I decide to use username as the primary key. I want to support user deletion, but as I need to track the existence of users historically for auditing purposes I use soft deletion (in the first version of the schema, I add a 'deleted' flag to the user, and ensure that the deleted flag is checked in all queries where only active users are expected).
An additional requirement, however, is that if a username is deleted, it should be available for new users to register. An attractive way to achieve this would be to have the deleted flag change to a nullable timestamp (where nulls indicate that the user has not been deleted) and put this in the primary key. Were primary keys to allow nullable columns, this would have the following effect:
Creating a new user with an existing username when that user's deleted column is null would be denied as a duplicate key entry
Deleting a user changes its key (which requires changes to cascade to foreign keys that reference the user, which is suboptimal but if deletions are rare is acceptable) so that the deleted column is a timestamp for the when the deletion occurred
Now a new user (which would have a null deleted timestamp) can be successfully created.
However, this cannot actually be achieved with standard SQL, so instead one must use a different primary key (probably a generated numeric user id in this case) and use a UNIQUE constraint to enforce the uniqueness of (username,deleted).
Having primary key null can be beneficial in some scenarios. In one of my projects I used this feature during synchronisation of databases: one on server and many on different users devices. Considering the fact that not all users have access to the Internet all the time, I decided that only the main database will be able to give ids to my entities. SQLite has its own mechanism for numbering rows. Had I used additional id field I would use more bandwith. Having null as id not only notifies me that an entity has been created on clients device when he hadn't access to the Internet, but also decreases code complexity. The only drawback is that on clients device I can't get an entity by it's id unless it was previously synchronised with main database. However thats not an issue since my user cares for entities for their parameters, not their unique id.

Too many lookup tables

What are the adverse effects of having too many lookup tables in the database?
I have to incorportate too many Enumerations, based on the applications.
What would experts advice?
Initially you have to ask yourself "how many is too many?". If there is a logical relation between two tables, there has to be a FK.
If you don't need the related tables anywhere within the database, you could consider to remove them and use a CHECK constraint with an "IN" clause to enforce data validity. Though, this would cause an alteration of the table with each new value within the enumeration.
My personal advice is to keep the FKs and the tables. It's a clear solution and the database is way better to maintain if there is a describing text available for all those numbers.
Let me tell how awful it is to have too few lookup tables. THe orginal designers at one place I worked decided to put all lookups into one table and define what the lookups were for using a typeid. This caused almost all queries to hit this table to get the lookup descriptive value causing a performance jam.
Further, without separate lookups, the fields that took the typeid were not constrained by the values appropriate to that field because a foreign key can only be on the the whole table not a chunk. So the filed that stored the clientid might accidentally contain the value for a user group. This caused data integrity problems and made reporting much more difficult as we had to intepret values that didn't make sense in context. There is no prize for using too few tables, in fact it is often an anti-pattern in database design.
Create 1000 lookup tables if that is what you need.
As Florian, I like a lot more to have tons of Foreign Keys then to have CHECK IN (..) - for a simple reason: you can insert other records on your tables.
Maintaning CHECK IN () is a much bigger problem. Imagine this scenario:
CREATE TABLE street
(
id serial not null,
st_type varchar(20) not null,
st_name varchar(100) not null,
constraint street_pk primary key (id)
constraint street_type_check check st_type in ('STREET','AVENUE','SQUARE')
);
You have 1000 rows with those types checked, correct? If you need to add another one, you will need to drop the constraint and recreate it.
IF you take a item off that list, like SQUARE, what will happen to the rows already commited (and checked at moment of insertion) that have that type? They will still keep an invalid type.
Tables and Foreign Keys are easier to maintain and keep track of.
The Whole point of lookup data is that there is a finite list of valid identifiers for a specific field. if those specific fields are used in procedures or where statements to determine the correct process path or the limit the select list, then there is no such thing as too many lookups.
if it is not a finite list of identifiers for a specific process or where clause then they should not be a lookup value.
two types of fields that come to mind which might be considered lookup values but don't necessarily need to be.
City and Province/state:
There is a finite list of these but because there are sooo many you might not want to make a lookup for these.

how can i have a unique column in many tables

I have ten or more(i don't know) tables that have a column named foo with same datatype.
how can i tell sql that values in all the tables should be unique.
I mean If(i have value "1" in table1) I should NOT be able to have value "1" in table2
Have a common ID's table, which these ten tables reference. That will work well in that it will ensure unique ID's, but doesn't mean you couldn't duplicate the ID's in the table if someone really wants to.
What I mean is a common ID's table ensures that you don't have duplicates for insert (by also inserting an ID into this common table), but the thing is the way to guarantee that it never happens is by building the business rules into the system or placing check constraints to cross reference the other tables (which would ensure uniqueness, but degrade performance).
The question is phrased vaguely; if you need to generate a column that's unique among several tables, use row GUIDs or a common ID generator table; if you need to enforce uniqueness (and the field values are already there), use triggers.
Generally, if you generate the values, you don't need to enforce anything. The generation logic, if done right, will take care of that. If you are inserting, say, user input, then you can and should enforce uniqueness during insertion. As a validation rule or something.
You can define the field as a GUID (or a UNIQUEIDENTIFIER in SQL server). Then it will always be unique no matter what.
How about setting a check constraint on each table, such that ID % 10 = N (where N is the table number, from 0-9). And use IDENTITY(N,10) each time.
I would suggest that possibly your design is flawed. Why are these separate tables? It ouwld be better to put them in one table with one id field and another filed to identify whatever is making these spearate tables (cusotmer id for instance). Then you can read about partioning tables if you want them to be split by customer for performance reasons.

Surrogate vs. natural/business keys [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here we go again, the old argument still arises...
Would we better have a business key as a primary key, or would we rather have a surrogate id (i.e. an SQL Server identity) with a unique constraint on the business key field?
Please, provide examples or proof to support your theory.
Just a few reasons for using surrogate keys:
Stability: Changing a key because of a business or natural need will negatively affect related tables. Surrogate keys rarely, if ever, need to be changed because there is no meaning tied to the value.
Convention: Allows you to have a standardized Primary Key column naming convention rather than having to think about how to join tables with various names for their PKs.
Speed: Depending on the PK value and type, a surrogate key of an integer may be smaller, faster to index and search.
Both. Have your cake and eat it.
Remember there is nothing special about a primary key, except that it is labelled as such. It is nothing more than a NOT NULL UNIQUE constraint, and a table can have more than one.
If you use a surrogate key, you still want a business key to ensure uniqueness according to the business rules.
It appears that no one has yet said anything in support of non-surrogate (I hesitate to say "natural") keys. So here goes...
A disadvantage of surrogate keys is that they are meaningless (cited as an advantage by some, but...). This sometimes forces you to join a lot more tables into your query than should really be necessary. Compare:
select sum(t.hours)
from timesheets t
where t.dept_code = 'HR'
and t.status = 'VALID'
and t.project_code = 'MYPROJECT'
and t.task = 'BUILD';
against:
select sum(t.hours)
from timesheets t
join departents d on d.dept_id = t.dept_id
join timesheet_statuses s on s.status_id = t.status_id
join projects p on p.project_id = t.project_id
join tasks k on k.task_id = t.task_id
where d.dept_code = 'HR'
and s.status = 'VALID'
and p.project_code = 'MYPROJECT'
and k.task_code = 'BUILD';
Unless anyone seriously thinks the following is a good idea?:
select sum(t.hours)
from timesheets t
where t.dept_id = 34394
and t.status_id = 89
and t.project_id = 1253
and t.task_id = 77;
"But" someone will say, "what happens when the code for MYPROJECT or VALID or HR changes?" To which my answer would be: "why would you need to change it?" These aren't "natural" keys in the sense that some outside body is going to legislate that henceforth 'VALID' should be re-coded as 'GOOD'. Only a small percentage of "natural" keys really fall into that category - SSN and Zip code being the usual examples. I would definitely use a meaningless numeric key for tables like Person, Address - but not for everything, which for some reason most people here seem to advocate.
See also: my answer to another question
Surrogate key will NEVER have a reason to change. I cannot say the same about the natural keys. Last names, emails, ISBN nubmers - they all can change one day.
Surrogate keys (typically integers) have the added-value of making your table relations faster, and more economic in storage and update speed (even better, foreign keys do not need to be updated when using surrogate keys, in contrast with business key fields, that do change now and then).
A table's primary key should be used for identifying uniquely the row, mainly for join purposes. Think a Persons table: names can change, and they're not guaranteed unique.
Think Companies: you're a happy Merkin company doing business with other companies in Merkia. You are clever enough not to use the company name as the primary key, so you use Merkia's government's unique company ID in its entirety of 10 alphanumeric characters.
Then Merkia changes the company IDs because they thought it would be a good idea. It's ok, you use your db engine's cascaded updates feature, for a change that shouldn't involve you in the first place. Later on, your business expands, and now you work with a company in Freedonia. Freedonian company id are up to 16 characters. You need to enlarge the company id primary key (also the foreign key fields in Orders, Issues, MoneyTransfers etc), adding a Country field in the primary key (also in the foreign keys). Ouch! Civil war in Freedonia, it's split in three countries. The country name of your associate should be changed to the new one; cascaded updates to the rescue. BTW, what's your primary key? (Country, CompanyID) or (CompanyID, Country)? The latter helps joins, the former avoids another index (or perhaps many, should you want your Orders grouped by country too).
All these are not proof, but an indication that a surrogate key to uniquely identify a row for all uses, including join operations, is preferable to a business key.
I hate surrogate keys in general. They should only be used when there is no quality natural key available. It is rather absurd when you think about it, to think that adding meaningless data to your table could make things better.
Here are my reasons:
When using natural keys, tables are clustered in the way that they are most often searched thus making queries faster.
When using surrogate keys you must add unique indexes on logical key columns. You still need to prevent logical duplicate data. For example, you can’t allow two Organizations with the same name in your Organization table even though the pk is a surrogate id column.
When surrogate keys are used as the primary key it is much less clear what the natural primary keys are. When developing you want to know what set of columns make the table unique.
In one to many relationship chains, the logical key chains. So for example, Organizations have many Accounts and Accounts have many Invoices. So the logical-key of Organization is OrgName. The logical-key of Accounts is OrgName, AccountID. The logical-key of Invoice is OrgName, AccountID, InvoiceNumber.
When surrogate keys are used, the key chains are truncated by only having a foreign key to the immediate parent. For example, the Invoice table does not have an OrgName column. It only has a column for the AccountID. If you want to search for invoices for a given organization, then you will need to join the Organization, Account, and Invoice tables. If you use logical keys, then you could Query the Organization table directly.
Storing surrogate key values of lookup tables causes tables to be filled with meaningless integers. To view the data, complex views must be created that join to all of the lookup tables. A lookup table is meant to hold a set of acceptable values for a column. It should not be codified by storing an integer surrogate key instead. There is nothing in the normalization rules that suggest that you should store a surrogate integer instead of the value itself.
I have three different database books. Not one of them shows using surrogate keys.
I want to share my experience with you on this endless war :D on natural vs surrogate key dilemma. I think that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. So depending on your situation, it might be more relevant to choose one method or the other.
As it seems that many people present surrogate keys as the almost perfect solution and natural keys as the plague, I will focus on the other point of view's arguments:
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following points:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Myths on natural keys
Composite keys are less inefficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Conclusion
Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
Alway use a key that has no business meaning. It's just good practice.
EDIT: I was trying to find a link to it online, but I couldn't. However in 'Patterns of Enterprise Archtecture' [Fowler] it has a good explanation of why you shouldn't use anything other than a key with no meaning other than being a key. It boils down to the fact that it should have one job and one job only.
Surrogate keys are quite handy if you plan to use an ORM tool to handle/generate your data classes. While you can use composite keys with some of the more advanced mappers (read: hibernate), it adds some complexity to your code.
(Of course, database purists will argue that even the notion of a surrogate key is an abomination.)
I'm a fan of using uids for surrogate keys when suitable. The major win with them is that you know the key in advance e.g. you can create an instance of a class with the ID already set and guaranteed to be unique whereas with, say, an integer key you'll need to default to 0 or -1 and update to an appropriate value when you save/update.
UIDs have penalties in terms of lookup and join speed though so it depends on the application in question as to whether they're desirable.
Using a surrogate key is better in my opinion as there is zero chance of it changing. Almost anything I can think of which you might use as a natural key could change (disclaimer: not always true, but commonly).
An example might be a DB of cars - on first glance, you might think that the licence plate could be used as the key. But these could be changed so that'd be a bad idea. You wouldnt really want to find that out after releasing the app, when someone comes to you wanting to know why they can't change their number plate to their shiny new personalised one.
Always use a single column, surrogate key if at all possible. This makes joins as well as inserts/updates/deletes much cleaner because you're only responsible for tracking a single piece of information to maintain the record.
Then, as needed, stack your business keys as unique contraints or indexes. This will keep you data integrity intact.
Business logic/natural keys can change, but the phisical key of a table should NEVER change.
Case 1: Your table is a lookup table with less than 50 records (50 types)
In this case, use manually named keys, according to the meaning of each record.
For Example:
Table: JOB with 50 records
CODE (primary key) NAME DESCRIPTION
PRG PROGRAMMER A programmer is writing code
MNG MANAGER A manager is doing whatever
CLN CLEANER A cleaner cleans
...............
joined with
Table: PEOPLE with 100000 inserts
foreign key JOBCODE in table PEOPLE
looks at
primary key CODE in table JOB
Case 2: Your table is a table with thousands of records
Use surrogate/autoincrement keys.
For Example:
Table: ASSIGNMENT with 1000000 records
joined with
Table: PEOPLE with 100000 records
foreign key PEOPLEID in table ASSIGNMENT
looks at
primary key ID in table PEOPLE (autoincrement)
In the first case:
You can select all programmers in table PEOPLE without use of join with table JOB, but just with: SELECT * FROM PEOPLE WHERE JOBCODE = 'PRG'
In the second case:
Your database queries are faster because your primary key is an integer
You don't need to bother yourself with finding the next unique key because the database itself gives you the next autoincrement.
Surrogate keys can be useful when business information can change or be identical. Business names don't have to be unique across the country, after all. Suppose you deal with two businesses named Smith Electronics, one in Kansas and one in Michigan. You can distinguish them by address, but that'll change. Even the state can change; what if Smith Electronics of Kansas City, Kansas moves across the river to Kansas City, Missouri? There's no obvious way of keeping these businesses distinct with natural key information, so a surrogate key is very useful.
Think of the surrogate key like an ISBN number. Usually, you identify a book by title and author. However, I've got two books titled "Pearl Harbor" by H. P. Willmott, and they're definitely different books, not just different editions. In a case like that, I could refer to the looks of the books, or the earlier versus the later, but it's just as well I have the ISBN to fall back on.
On a datawarehouse scenario I believe is better to follow the surrogate key path. Two reasons:
You are independent of the source system, and changes there --such as a data type change-- won't affect you.
Your DW will need less physical space since you will use only integer data types for your surrogate keys. Also your indexes will work better.
As a reminder it is not good practice to place clustered indices on random surrogate keys i.e. GUIDs that read XY8D7-DFD8S, as they SQL Server has no ability to physically sort these data. You should instead place unique indices on these data, though it may be also beneficial to simply run SQL profiler for the main table operations and then place those data into the Database Engine Tuning Advisor.
See thread # http://social.msdn.microsoft.com/Forums/en-us/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be
This is one of those cases where a surrogate key pretty much always makes sense. There are cases where you either choose what's best for the database or what's best for your object model, but in both cases, using a meaningless key or GUID is a better idea. It makes indexing easier and faster, and it is an identity for your object that doesn't change.
In the case of point in time database it is best to have combination of surrogate and natural keys. e.g. you need to track a member information for a club. Some attributes of a member never change. e.g Date of Birth but name can change.
So create a Member table with a member_id surrogate key and have a column for DOB.
Create another table called person name and have columns for member_id, member_fname, member_lname, date_updated. In this table the natural key would be member_id + date_updated.
Horse for courses. To state my bias; I'm a developer first, so I'm mainly concerned with giving the users a working application.
I've worked on systems with natural keys, and had to spend a lot of time making sure that value changes would ripple through.
I've worked on systems with only surrogate keys, and the only drawback has been a lack of denormalised data for partitioning.
Most traditional PL/SQL developers I have worked with didn't like surrogate keys because of the number of tables per join, but our test and production databases never raised a sweat; the extra joins didn't affect the application performance. With database dialects that don't support clauses like "X inner join Y on X.a = Y.b", or developers who don't use that syntax, the extra joins for surrogate keys do make the queries harder to read, and longer to type and check: see #Tony Andrews post. But if you use an ORM or any other SQL-generation framework you won't notice it. Touch-typing also mitigates.
Maybe not completely relevant to this topic, but a headache I have dealing with surrogate keys. Oracle pre-delivered analytics creates auto-generated SKs on all of its dimension tables in the warehouse, and it also stores those on the facts. So, anytime they (dimensions) need to be reloaded as new columns are added or need to be populated for all items in the dimension, the SKs assigned during the update makes the SKs out of sync with the original values stored to the fact, forcing a complete reload of all fact tables that join to it. I would prefer that even if the SK was a meaningless number, there would be some way that it could not change for original/old records. As many know, out-of-the box rarely serves an organization's needs, and we have to customize constantly. We now have 3yrs worth of data in our warehouse, and complete reloads from the Oracle Financial systems are very large. So in my case, they are not generated from data entry, but added in a warehouse to help reporting performance. I get it, but ours do change, and it's a nightmare.

Resources