In one of my database class assignments, I wrote that I specifically didn't assign lengths to my NUMBER columns acting as surrogate keys since it would unnecessarily limit the number of records able to be stored in the table, and because there is literally no difference in performance or physical storage between NUMBER(n) and NUMBER.
My professor wrote back that it would be technically possible but "impractical" for large databases, and that most DBAs in real-life situations would not do that.
There is no difference whatsoever between NUMBER(n) and NUMBER as far as physical storage or performance goes, and thus no reason to specify a length for a NUMBER-based surrogate key column. Why does this professor think that using NUMBER alone would be "impractical"?
In my experience, most production DBAs in real life would likely do as you suggested and declare key columns as NUMBER rather than NUMBER(n).
It would be worthwhile to ask the professor what makes this approach impractical in his or her opinion. There are a couple possibilities that I can think of
Assuming that you are using a data modeling tool to design your schema, a reasonable tool will ensure that the data type of a key will be the same in the table where it is defined as a primary key and in the child table where it is a foreign key. If you specify a length for the primary key, forcing the key to generate foreign keys without length limits would be impractical. Of course, the counter to this is that you can just declare both the primary and foreign key columns as NUMBER.
DBAs tend to be extremely picky (and I mean this as a compliment). They like to see everything organized "just so". Adding a length qualifier to a field whether it be a NUMBER or a VARCHAR2 serves as an implicit constraint that ensure that incorrect data does not get stored. Ideally, you would know when you are designing a table a reasonable upper bound on the number of rows you'll insert over the table's lifetime (i.e. if your PERSON table ended up with more than 10 billion rows, something would likely be seriously wrong). Applying length constraints to numeric columns demonstrates to the DBA that you've done this sort of analysis.
Today, however, that is rather unlikely to actually happen at least with respect to numeric columns both because it is something that is more in keeping with waterfall planning methods that would generally involve that sort of detailed design discussion and because people are less concerned with the growth analysis that would have traditionally been done at the same time. If you were designing a database schema 20 or 30 years ago, it wouldn't be uncommon to provide the DBA with a table-by-table breakdown of the projected size of each table at go-live and over the next few years. Today, it's more cost effective to potentially spend more on disk rather than investing the time to do this analysis up front.
It would probably be better from a readability and self documentation standpoint to limit what can be stored in the column to numbers that are expected. I would agree that I don't see how its impractical
From this thread about number
number(n) is an edit -- restricting the number to n digits in length.
if you store the number '55', it takes the same space in a number(2)
as it does in a number(38).
the storage required = function of the number actually stored.
Left to my own devices I would declare surrogate primary keys as NUMBER(38) on oracle instead of NUMBER. And possibly a check constraint to make the key > 0. Primarily to serve as documentation to outside systems about what they can expect in the column and what they need to be able to handle.
In theory, when building an application that is reading the surrogate primary key, seeing NUMBER means one needs to handle full floating point range of number, whereas NUMBER(38) means the application needs to handle an integer with up to 38 digits.
If I were working in an environment where all the front ends were going to be using a 32 bit integer for surrogate keys I'd define it as a number(10) with appropriate check constraint.
Related
Is it really that bad to use "varchar" as the primary key?
(will be storing user documents, and yes it can exceed 2+ billion documents)
It totally depends on the data. There are plenty of perfectly legitimate cases where you might use a VARCHAR primary key, but if there's even the most remote chance that someone might want to update the column in question at some point in the future, don't use it as a key.
If you are going to be joining to other tables, a varchar, particularly a wide varchar, can be slower than an int.
Additionally if you have many child records and the varchar is something subject to change, cascade updates can causes blocking and delays for all users. A varchar like a car VIN number that will rarely if ever change is fine. A varchar like a name that will change can be a nightmare waiting to happen. PKs should be stable if at all possible.
Next many possible varchar Pks are not really unique and sometimes they appear to be unique (like phone numbers) but can be reused (you give up the number, the phone company reassigns it) and then child records could be attached to the wrong place. So be sure you really have a unique unchanging value before using.
If you do decide to use a surrogate key, then make a unique index for the varchar field. This gets you the benefits of the faster joins and fewer records to update if something changes but maintains the uniquess that you want.
Now if you have no child tables and probaly never will, most of this is moot and adding an integer pk is just a waste of time and space.
I realize I'm a bit late to the party here, but thought it would be helpful to elaborate a bit on previous answers.
It is not always bad to use a VARCHAR() as a primary key, but it almost always is. So far, I have not encountered a time when I couldn't come up with a better fixed size primary key field.
VARCHAR requires more processing than an integer (INT) or a short fixed length char (CHAR) field does.
In addition to storing extra bytes which indicate the "actual" length of the data stored in this field for each record, the database engine must do extra work to calculate the position (in memory) of the starting and ending bytes of the field before each read.
Foreign keys must also use the same data type as the primary key of the referenced parent table, so processing further compounds when joining tables for output.
With a small amount of data, this additional processing is not likely to be noticeable, but as a database grows you will begin to see degradation.
You said you are using a GUID as your key, so you know ahead of time that the column has a fixed length. This is a good time to use a fixed length CHAR(36) field, which incurs far less processing overhead.
I think int or bigint is often better.
int can be compared with less CPU instructions (join querys...)
int sequence is ordered by default -> balanced index tree -> no reorganisation if you use an PK as clustered index
index need potentially less space
Use an ID (this will become handy if you want to show only 50 etc...). Than set a constraint UNIQUE on your varchar with the file-names (I assume, that is what you are storing).
This will do the trick and will increase speed.
I am currently planning to develop a music streaming application. And i am wondering what would be better as a primary key in my tables on the server. An ID int or a Unique String.
Methods 1:
Songs Table:
SongID(int), Title(string), *Artist**(string), Length(int), *Album**(string)
Genre Table
Genre(string), Name(string)
SongGenre:
***SongID****(int), ***Genre****(string)
Method 2
Songs Table:
SongID(int), Title(string), *ArtistID**(int), Length(int), *AlbumID**(int)
Genre Table
GenreID(int), Name(string)
SongGenre:
***SongID****(int), ***GenreID****(int)
Key: Bold = Primary Key, *Field** = Foreign Key
I'm currently designing using method 2 as I believe it will speed up lookup performance and use less space as an int takes a lot less space then a string.
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Yes. Integer IDs are very bad if you need to uniquely identify the same data outside of a single database. For example, if you have to copy the same data into another database system with potentially pre-existing data or you have a distributed database. The biggest thing to be aware of is that an integer like 7481 has no meaning outside of that one database. If later on you need to grow that database, it may be impossible without surgically removing your data.
The other thing to keep in mind is that integer IDs aren't as flexible so they can't easily be used for exceptional cases. The designers of the Internet Protocol understood this and took precautions by allocating certain blocks of numbers as "special" in one way or another (broadcast IPs, private IPs, network IPs). But that was only possible because there's a protocol surrounding the usage of those numbers. Many databases don't operate within such a well-defined protocol.
FWIW, it's kind of like trying to decide if having a "strongly typed" programming paradigm is better than a "weakly/dynamically typed" programming paradigm. It will depend on what you need to do.
You are doing the right thing - identity field should be numeric and not string based, both for space saving and for performance reasons (matching keys on strings is slower than matching on integers).
From the software perspective the GUID is better as its unique globally.
Quotes from: Primary Keys: IDs versus GUIDs
Using a GUID as a row identity value feels more natural-- and
certainly more truly unique-- than a 32-bit integer. Database guru Joe
Celko seems to agree. GUID primary keys are a natural fit for many
development scenarios, such as replication, or when you need to
generate primary keys outside the database. But it's still a question
of balancing the tradeoffs between traditional 4-byte integer IDs and
16-byte GUIDs:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
you're not careful
Cumbersome to debug where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}'
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
clustered indexes
My recommendation is: use ids.
You'll be able to rename that "Genre" with 20000 songs without breaking anything.
The idea behind this is that the id identifies the row in the table. Whatever the row has is something that doesn't matters in this problem.
This is in large part a matter of personal preference.
My personal opinion and practice is to always use integer keys and to always use surrogate rather than natural keys (so never use anything like social security number or the genre name directly).
There are cases where an auto number field is not appropriate or does not scale. In these cases it can make sense to use a GUID, which can be a string in databases that do not have a native datatype for it.
I'm fairly well versed in SQL server performace but I constanly have to argue down the idea that GUIDs should be used as the default type for Clusterd Primary Keys.
Assuming that the table has a fairly low amount of inserts per day (5000 +/- rows / day), what kind of performace issues could we run into? How will page splits affect our seek performance? How often should I reindex (or should I defrag)? What should I set the fill factors to (100, 90, 80, ect)?
What if I were inserting 1,000,000 rows per day?
I apologize beforhand for all of the questions, but i'm looking to get some backup for not using GUIDs as our default for PKs. I am however completely open to having my mind changed by the overwehlming knowledge from the StackOverflow user base.
If you are doing any kind of volume, GUIDs are extremely bad as a PK bad unless you use sequential GUIDs, for the exact reasons you describe. Page fragmentation is severe:
Average Average
Fragmentation Fragment Fragment Page Average
Type in Percent Count Size Count Space Used
id 4.35 7 16.43 115 99.89
newidguid 98.77 162 1 162 70.90
newsequentualid 4.35 7 16.43 115 99.89
And as this comparison between GUIDs and integers shows:
Test1 caused a tremendous amount of page splits, and had a scan density around 12% when I ran a DBCC SHOWCONTIG after the inserts had completed. The Test2 table had a scan density around 98%
If your volume is very low, however, it just doesn't matter that much.
If you do really need a globally unique ID but have high volume (and can't use sequential IDs), just put the GUIDs in an indexed column.
Drawbacks of using GUID as primary key:
No meaningful ordering, means indexing doesn't give performance boost as it does with an integer.
Size of a GUID 16 bytes, versus 2, 4 or 8 bytes for an integer.
Very difficult for humans to remember, so no good as a reference id.
Advantages:
Allow non-guessable primary keys that can therefore be less dangerous when displayed in a web page query string or in the application.
Useful in Databases that don't provide an auto increment or identity data type.
Useful when you need to join data between two disparate data sources across platforms or environments.
I thought the decision as to whether to use GUIDs was pretty simple, but maybe I'm unaware of other issues.
With such a low inserts per day, I doubt that page splitting should be a significant factor. The real question is how does 5,000 compares with the existing row count, as this would be the main information needed to decide on an appropriate initial fill factor to deffer splits.
This said, I'm personally not a big fan of GUIDs. I understand that they can serve well in some contexts but in many cases they are just "in the way" [of efficiency, of ease of use, of ...]
I find the following questions useful to narrow down on deciding whether GUID should be used or not.
Will the PK be shared/published ? (i.e. will it be used beyond its internal use within SQL, will applications need these keys in a somewhat persistent fashion? Will
users somehow see these keys?
Could the PK be used to help merge disparate data sources ?
Does the table have a primary -possibly composite- made from column(s) in the data ? What is the size of this possible this key
How do the primary keys sort? If composite, are the first few columns selective ?
Using a guid (unless it is a sequential GUID) as a clustered index is going to kill insert performance. Since the physical table layout is aligned according to the clustered index, using a guid which has a random sequencing order will cause serious table fragmentation. If you want to use a guid as a PK/Clustered index it must be a sequential guid using the newsequentialid() function in sql server. This will guarantee that the generated guids are ordered sequentially and prevent fragmentation.
We are trying to come up with a numbering system for the asset system that we are creating, there has been a few heated discussions on this topic in the office so I decided to ask the experts of SO.
Considering the database design below what would be the better option.
Example 1: Using auto surrogate keys.
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
1 1
Example 2: Using program generated PK
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
"RD00000001WCK" "00000001.1"
(the 00000001.1 means it's the first segment of the road. This increases everytime you add a new segment e.g. 00000001.2)
Example 3: Using a bit of both(adding a new column)
======================= ==========================
ID(PK) Road_Number(UK) ID(PK) Segment_Number(UK)
======================= ==========================
1 "RD00000001WCK" 1 "00000001.1"
Just a bit of background information, we will be using the Road Number and Segment Number in reports and other documents, so they have to be unique.
I have always liked keeping things simple so I prefer example 1, but I have been reading that you should not expose your primary keys in reports/documents. So now I'm thinking more along the lines of example 3.
I am also leaning towards example 3 because if we decide to change how our asset numbering is generated it won't have to do cascade updates on a primary key.
What do you think we should do?
Thanks.
EDIT: Thanks everyone for the great answers, has help me a lot.
This is really a discussion about surrogate (also called technical or synthetic) vs natural primary keys, a subject that has been extensively covered. I covered this in Database Development Mistakes Made by AppDevelopers.
Natural keys are keys based on
externally meaningful data that is
(ostensibly) unique. Common examples
are product codes, two-letter state
codes (US), social security numbers
and so on. Surrogate or technical
primary keys are those that have
absolutely no meaning outside the
system. They are invented purely for
identifying the entity and are
typically auto-incrementing fields
(SQL Server, MySQL, others) or
sequences (most notably Oracle).
In my opinion you should always
use surrogate keys. This issue has
come up in these questions:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Which format of primary key would you use in this situation.
Surrogate Vs. Natural/Business Keys
Should I have a dedicated primary key field?
Auto number fields are the way to go. If your keys have meaning outside your database (like asset numbers) those will quite possibly change and changing keys is problematic. Just use indexes for those things into the relevant tables.
I would personally say keep it simple and stay with an autoincremented primary key. If you need something more "Readable" in terms of display in the program, then possibly one of your other ideas, but I think that is just adding unneeded complexity to the primary key field.
I'm also very strongly in the "don't use primary keys as meaningful data" camp. Every time I have contravened that policy it has ended in tears. Sooner or later the meaningful data needs to change and if that means you have to change a primary key it can get painful. The primary key will probably be used in foreign key constraints and you can spend ages trying to sort it all out just to make a simple data change.
I always use GUIDs/UUIDs for my primary keys in every table I ever create but that's just personal preference serials or such are also good.
Don't put meaning into your PK fields unless...
It is 100% completely impossible that
the value will never change and that
No two people would ever reasonably
argue about which value should be
used for a particular row.
Go with option one and format the value in the app to look like option two or three when it is displayed.
I think the important thing to remember here is that each table in your database/design might have multiple keys. These are the Candidate Keys.
See wikipedia entry for Candidate Keys
By definition, all Candidate Keys are created equal. They are each unique identifiers for the table in question.
Your job then is to select the best candidate from the pool of Candidate Keys to serve as the Primary Key. The Primary Key will be used by other tables to establish the relational constraints, but you are free to continue using Candidate Keys to query the table.
Because Primary Keys are referenced by other structures, and therefore used in join operations, the criteria for Primary Key selection boils down to the following for me (in order of importance):
Immutable/Stable - Primary Key values should not change. If they do, you run the risk of introducing update anomolies
Not Null - most DBMS platforms require that the Primary Key attribute(s) are not null
Simple - simple datatypes and values for physical storage and performance. Integer values work well here, and this is the datatype of choice for most surrogate/auto-gen keys
Once you've identified the Candidate Keys, the criteria above can be used to select the Primary Key. If there is not a "Natural" Candidate Key meets the criteria, then a Surrogate Key that does meet the criteria can be created and used as mentioned in other answers.
Follow the Don't Use policy.
Some problems you can run into:
You need to generate keys from more than one host.
Someone will want to reserve contiguous numbers to use together.
How meaningful will people want it to be? Wars are fought over this, and you're in the first skirmish of one already. "It's already meaningful, and if we just add two more digits we can ..." i.e. you're establishing a design style that will (should) be extensible.
If you are concatenating the two, you're doing typecasts which can mess up your query Optimizer.
You'll need to reclassify roads, and redefine their boundaries (i.e. move the roads), which implies changing the primary key and maybe losing links.
There are workarounds for all this, but this is the kind of issue where workarounds proliferate and get out of control. And it doesn't take more than a couple to get beyond "Simple".
As mentioned before, keep your internal primary keys as just keys, whatever the most optimal datatype is on your platform.
However you do need to let the numbering system argument be fought out, as this is actually a business requirement, and perhaps let's call it an identification system for the asset.
If there is only going to be one identifier, then add it as a column to the main table. If there are likely to be many identification systems (and assets usually have many), you'll need two more tables
Identifier-type table Identifier-cross-ref table
type-id ------------> type-id (unique
type-name identifier-string key)
internal-id
That way different people who need to access the asset can identify in their own way. For example the server team will identify a server differently from the network team and different again from project management, accounts, etc.
Plus, you get to go to all the meetings where everyone argues with each other.
Another thing to keep in mind is that if you're importing alot of data into this system, you may find out that things like Road_Number are not as unique as you thought, and there may be operational roadblocks to fixing the problem (repainting road signs, etc.) .
While natural keys may have great meaning to the business users, if you do not have the agreement that those keys are sacred and should not be altered, you will more than likely be pulling your hair out while maintaining a database where the "product codes have to be changed to accommodate the new product line the company acquired." You need to protect the RI of your data, and integers as primary keys with auto-increment are the best way to go. Performance is also better when indexing and traversing integers than char columns.
While not appropriate as primary keys, natural keys are very appropriate for user consumption and you can enforce uniques via an index. They bring a context to the data that will make it easier for all parties to understand. Also, in the advent that you need to reload data, the natural keys can help verify that your lookups are still valid.
I would go with the surrogate key, but you may want to have a computed column that "formats" the surrogate key into a more "readable" value if that improves your reporting. The computed colum could produce example 2 from the surrogate key for instance for display purposes.
I think the surrogate key route is the way to go and the only exceptions that I make for it are join tables, where the primary key could be composed of the foreign key references. Even in these cases I'm finding that having a surrogate primary key is more useful than not.
I suspect that you really should use option #3, as many here have already said. Surrogate PKs (either Integers or GUIDs) are good practice, even if there are adequate business keys. Surrogates will reduce maintenance headaches (as you yourself have already noted).
That being said, something you may want to consider is whether or not your database is:
focused on data maintenance and transactional processing (i.e. Create/Update/Delete operations)
geared towards analysis and reporting (i.e. Queries)
In other words, are the users concerned with maintaining active data or querying largely static data to find answers?
If you are heavily focused on building an analysis and reporting DB (e.g. a data warehouse/mart) that is exposed to technical business users (e.g. report designers) who have a good grasp of the business vocabulary, then you might want to consider using natural keys based on meaningful business values. They help reduce query complexity by eliminating the need for complex joins and help the user focus on their task, not fighting the database structure.
Otherwise you're probably focused on a full CRUD DB that has to cover all the bases to some degree - this is the vast majority of situations. In which case, go with your option #3. You can always optimize for queryability in the future but you'll be hard pressed to retrofit for maintainability.
I hope you will agree with me that every design element should have single purpose.
Question is what do you think is purpose of PK? If it is to identify unique record in a table, then surrogate keys wins without much trouble. This is simple and straight.
As far as new columns in option 3 are concerned, you should check if these can be calculated (best would be to do calculation in model layer so that they can be changed easily than if calculation done in RDBMS) without too much of performance penalty from other elements. For example, you can store segment number and road number in corresponding tables and then use them to generate "00000001.1". This will allow to change asset numbering on-the-fly.
First off, option 2 is the absolute worst option. As an Index, it's a string, and that makes it slow. And it's generated based on business rules - which can change and cause a rather large headache.
Personally, I always use a separate primary key column; and I always use a GUID. Some developers prefer a simple INT over a GUID for reasons of hard-drive space. However, if the situation arises where you need to merge two databases, GUIDs will almost never collide (whereas INTs are guaranteed to collide).
Primary Keys should NEVER be seen by the user. Making it readable to the user should not be a concern. Primary Keys SHOULD be used to link with Foreign Keys. This is their purpose. The value should be machine readable and, once created, never changed.
I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.
Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?
Advantages:
Can generate them offline.
Makes replication trivial (as opposed to int's, which makes it REALLY hard)
ORM's usually like them
Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.
Disadvantages:
Larger space use, but space is cheap(er)
Can't order by ID to get the insert order.
Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
Harder to do manual debugging, but not that hard.
Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.
I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:
unique ID for the row (GUID/whatever). Never visible to the user.
public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)
UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.
If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).
So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).
Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(
#Matt Sheppard:
Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).
So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.
Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.
Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).
In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.
The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.
The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.
There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )
Edit merged my two answers to this question
#Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..
GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).
One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...
primary-keys-ids-versus-guids
The Cost of GUIDs as Primary Keys (SQL Server 2000)
Myths, GUID vs. Autoincrement (MySQL 5)
This is realy what you want.
UUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.
RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.
Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.
Advantages:
UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.
UUID is more safer to pass through url than integer type data.
If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
UUID can generate offline.
One thing not mentioned so far: UUIDs make it much harder to profile data
For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both
provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)
provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)
From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.
For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).
UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.