I've been learning ASP.net, and been using the membership system. When it auto generated the tables, I was quite suprised to see it uses a field type called 'uniqueIdentifier' as a primary key, when for many years I have been using an integer field set to be an identity that auto increments.
What is the difference (if any at all) between these two methods, and why does .NET appear to favour the unique identifier field?
Thanks for any info!
Tom
The uniqueidentifier type is SQL's Guid type (the corresponding BCL type is System.Guid). In concept, Guids represent a random 128-bit number that is supposed to be unique.
While Guid's have their detractors (comparing guids is, strictly speaking, slightly slower than comparing ints), their random nature makes them helpful in environments like replication, where using an incrementing key can be difficult.
I'd say that .NET doesn't favour the uniqueidentifier or guid as an id, but this particular implementation (the ASP.NET SQL Server membership provider) does. I suspect that those who developed the database were working with the assumption that the db usage wasn't to be for high traffic sites, or where heavy reporting was likely to be done.
Perhaps they were trying to avoid any problems with integrating in an existing application, or a future scenario whereby your application had a key for a user. This could be any kind of key for any entity (PK, UserNumber, etc). In the ASP.NET SQL Server implementation, the likelihood of having a collision is very low/approaching zero.
The one drawback that I've learned is that having a clustered index on a guid doesn't scale to large volume databases.
I'm largely in the integer-as-PK camp. They're small, use few bytes, and work very well when your database needs to scale.
What is the difference (if any at all)
between these two method
for one a uniqueidentifier is 16 bytes while an int is 4 bytes. IF you have a URL like
http://bla.com?UserID=1
you can easily guess what someone else's userid is so you can try 2 or 4 etc etc
when you have this as UserID C7478034-BB60-4F5A-BE51-72AAE5A96640 it is not as easily and also uniqueidentifiers are supposed to be unique accross all computers
if they use NEWID() instead of NEWSEQUENTIALID() then they will get fragmentation and page splits, take a look at Best Practice: Do not cluster on UniqueIdentifier when you use NewId
Related
What is "better" generating Primary Keys on the database or generating them in application code, specifically when using GUID/UniqueIdentifier datatype for the keys.
I have read up on the difference between using Guid's and int data types, and it sounds like Guids are feasable for so called "generating offline".
E.g.
instead of having a NEWID() contstraint in the database, In one project (where we are using Entity Framework) we use in the application code Guid.NewGuid() to generate the PK when inserting data.
Is this a bad approach ?
My concerns are:
Database indexes: Database performance because Id's might not be sequential
The one in 64 billion chance that the key is already used. (considering that the application will not be enormous but may need room to grow)
Perhaps there are other disadvantages ?
well actually, GUID could be sequential from SQL Server 2005. There is a function in named NEWSEQUENTIALID() , link here
Creates a GUID that is greater than any GUID previously generated by
this function on a specified computer since Windows was started. After
restarting Windows, the GUID can start again from a lower range, but
is still globally unique. When a GUID column is used as a row
identifier, using NEWSEQUENTIALID can be faster than using the NEWID
function. This is because the NEWID function causes random activity
and uses fewer cached data pages. Using NEWSEQUENTIALID also helps to
completely fill the data and index pages.
I am thinking to use a GUID in my .net app which uses SQL Server. Should I be writing a stored procedure which generates the GUID on each record entered or should I be directly generating it from the application.
Reasons for asking the question (If am wrong correct me in this):
I (as/pre)sume:
When generating the GUID from the database, you can assume that the DB remembers the previous generated GUID where as the application remembering it is difficult.
SQL Server has the creation of GUID's built in. There is no need to write a separate stored procedure for this.
You can use
NEWID()
NEWSEQUENTIALID()
The key difference between both procedures would be that the sequential GUID should be used if it is for a primary clustered key.
I'm not sure why you would want the database engine to remember the previous generated GUID.
No, your assumption is wrong: the database won't be remembering anything - so there's no benefit from that point of view.
If you're using the GUID as your primary key / clustering key in SQL Server, which is a bad idea to begin with (see here, here or here why that's the case), you should at least use the newsequentialid() function as default constraint on that column.
CREATE TABLE YourTable(ColumnA uniqueidentifier DEFAULT NEWSEQUENTIALID())
That way, the database would generate pseudo-sequential GUID's for your PK and thus would make the negative effects of using a GUID as PK/CK at least bearable....
If you're not using the GUID as your primary key, then I don't see any benefit in creating that GUID on the server, really.
My preference is to create GUID in the application not the db.
Simplifies the retrieval of rows after insertion.
Easier domain/business layer unit testing.
Faster. At least for entity framework. Link
RFC41221: «Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example. A predictable random number source will exacerbate the situation».
In simple task incremental uint64 better.
NOT USE GUID! If need security.
http://social.msdn.microsoft.com/Forums/en/netfxbcl/thread/b37b3438-90f4-41fb-adb9-3ddba16fe07c
I am currently planning to develop a music streaming application. And i am wondering what would be better as a primary key in my tables on the server. An ID int or a Unique String.
Methods 1:
Songs Table:
SongID(int), Title(string), *Artist**(string), Length(int), *Album**(string)
Genre Table
Genre(string), Name(string)
SongGenre:
***SongID****(int), ***Genre****(string)
Method 2
Songs Table:
SongID(int), Title(string), *ArtistID**(int), Length(int), *AlbumID**(int)
Genre Table
GenreID(int), Name(string)
SongGenre:
***SongID****(int), ***GenreID****(int)
Key: Bold = Primary Key, *Field** = Foreign Key
I'm currently designing using method 2 as I believe it will speed up lookup performance and use less space as an int takes a lot less space then a string.
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Yes. Integer IDs are very bad if you need to uniquely identify the same data outside of a single database. For example, if you have to copy the same data into another database system with potentially pre-existing data or you have a distributed database. The biggest thing to be aware of is that an integer like 7481 has no meaning outside of that one database. If later on you need to grow that database, it may be impossible without surgically removing your data.
The other thing to keep in mind is that integer IDs aren't as flexible so they can't easily be used for exceptional cases. The designers of the Internet Protocol understood this and took precautions by allocating certain blocks of numbers as "special" in one way or another (broadcast IPs, private IPs, network IPs). But that was only possible because there's a protocol surrounding the usage of those numbers. Many databases don't operate within such a well-defined protocol.
FWIW, it's kind of like trying to decide if having a "strongly typed" programming paradigm is better than a "weakly/dynamically typed" programming paradigm. It will depend on what you need to do.
You are doing the right thing - identity field should be numeric and not string based, both for space saving and for performance reasons (matching keys on strings is slower than matching on integers).
From the software perspective the GUID is better as its unique globally.
Quotes from: Primary Keys: IDs versus GUIDs
Using a GUID as a row identity value feels more natural-- and
certainly more truly unique-- than a 32-bit integer. Database guru Joe
Celko seems to agree. GUID primary keys are a natural fit for many
development scenarios, such as replication, or when you need to
generate primary keys outside the database. But it's still a question
of balancing the tradeoffs between traditional 4-byte integer IDs and
16-byte GUIDs:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
you're not careful
Cumbersome to debug where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}'
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
clustered indexes
My recommendation is: use ids.
You'll be able to rename that "Genre" with 20000 songs without breaking anything.
The idea behind this is that the id identifies the row in the table. Whatever the row has is something that doesn't matters in this problem.
This is in large part a matter of personal preference.
My personal opinion and practice is to always use integer keys and to always use surrogate rather than natural keys (so never use anything like social security number or the genre name directly).
There are cases where an auto number field is not appropriate or does not scale. In these cases it can make sense to use a GUID, which can be a string in databases that do not have a native datatype for it.
For SQL server is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?
That depends on what you're doing:
If speed is the primary concern then a plain old int is probably big enough.
If you really will have more than 2 billion (with a B ;) ) records, then use bigint or a sequential guid.
If you need to be able to easily synchronize with records created remotely, then Guid is really great.
Update
Some additional (less-obvious) notes on Guids:
They can be hard on indexes, and that cuts to the core of database performance
You can use sequential guids to get back some of the indexing performance, but give up some of the randomness used in point two.
Guids can be hard to debug by hand (where id='xxx-xxx-xxxxx'), but you get some of that back via sequential guids as well (where id='xxx-xxx' + '123').
For the same reason, Guids can make ID-based security attacks more difficult- but not impossible. (You can't just type 'http://example.com?userid=xxxx' and expect to get a result for someone else's account).
In general I'd recommend a BIGINT over a GUID (as guids are big and slow), but the question is, do you even need that? (I.e. are you doing replication?)
If you're expecting less than 2 billion rows, the traditional INT will be fine.
Are you doing replication or do you have sales people who run disconnected databses that need to merge, use a GUID. Otherwise I'd go for an int or bigint. They are far easier to deal with in the long run.
Depends no what you need. DB Performance would gain from integer while GUIDs are useful for replication and not requiring to hear back from DB what identity has been created, i.e. code could create GUID identity before inserting into row.
If you're planning on using merge replication then a ROWGUIDCOL is beneficial to performance (see here for info). Otherwise we need more info about what your definition of 'better' is; better for what?
Unless you have a real need for a GUID, such as being able to generate keys anywhere and not just on the server, then I would stick with using INTEGER-based keys. GUIDs are expensive to create and make it harder to actually look at the data. Plus, have you ever tried to type a GUID in an SQL query? It's painful!
There can be few more aspects or requirements to use GUID.
If the primary key is of any numeric type (Int, BigInt or any other), then either you need to make it Identity column, or you need to check the last saved value in the table.
And in that case, if the record in foreign table is saved as transaction, then it would be difficult to get the last identity value of primary key. Like if IDENT_CURRENT is used, then will be again effect performance while saving record in foreign key.
So in case of saving the records as for transactions, then it would be convenient to firstly generate Guid for primary key, and then save the generated key (Guid) in primary and foreign table(s).
It really depends on whether or not the information coming in is somehow sequential. I highly recommend for things such as users that a GUID might be better. But for sequential data, such as orders or other things that need to be easily sortable that a bigint may well be a better solution as it will be indexed and provide fast sorting without the cost of another index.
It really depends whether you're expecting to have replication in the picture. Replication requires a row UUID, so if you're planning on doing that you may as well do it up front.
I'm with Andrew Rollings.
Now you could argue space efficiency. An int is what, 8 bytes max? A guid is going to much longer.
But I have two main reasons for preference: readability and access time. Numbers are easier for me than GUIDs (since I can always find the next/previous record easily).
As for access time, note that some DBs can start to have BIG problems with GUIDs. I know this is the case with MySQL (MySQL InnoDB Primary Key Choice: GUID/UUID vs Integer Insert Performance). This may not be much of a problem with SQL Server, but it's something to watch out for.
I'd say stick with INT or BIGINT. The only time I would think you'd want the GUID is when you are going to give them out and don't want people to be able to guess the IDs of other records for security reasons.
I'm currently working on someone else's database where the primary keys are generated via a lookup table which contains a list of table names and the last primary key used. A stored procedure increments this value and checks it is unique before returning it to the calling 'insert' SP.
What are the benefits for using a method like this (or just generating a GUID) instead of just using the Identity/Auto-number?
I'm not talking about primary keys that actually 'mean' something like ISBNs or product codes, just the unique identifiers.
Thanks.
An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.
If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.
There's nothing inherently wrong with using AutoNumber, but there are a few reasons not to do it. Still, rolling your own solution isn't the best idea, as dacracot mentioned. Let me explain.
The first reason not to use AutoNumber on each table is you may end up merging records from multiple tables. Say you have a Sales Order table and some other kind of order table, and you decide to pull out some common data and use multiple table inheritance. It's nice to have primary keys that are globally unique. This is similar to what bobwienholt said about merging databases, but it can happen within a database.
Second, other databases don't use this paradigm, and other paradigms such as Oracle's sequences are way better. Fortunately, it's possible to mimic Oracle sequences using SQL Server. One way to do this is to create a single AutoNumber table for your entire database, called MainSequence, or whatever. No other table in the database will use autonumber, but anyone that needs a primary key generated automatically will use MainSequence to get it. This way, you get all of the built in performance, locking, thread-safety, etc. that dacracot was talking about without having to build it yourself.
Another option is using GUIDs for primary keys, but I don't recommend that because even if you are sure a human (even a developer) is never going to read them, someone probably will, and it's hard. And more importantly, things implicitly cast to ints very easily in T-SQL but can have a lot of trouble implicitly casting to a GUID. Basically, they are inconvenient.
In building a new system, I'd recommend using a dedicated table for primary key generation (just like Oracle sequences). For an existing database, I wouldn't go out of my way to change it.
from CodingHorror:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
The article provides a lot of good external links on making the decision on GUID vs. Auto Increment. If I can, I go with GUID.
It's useful for clients to be able to pre-allocate a whole bunch of IDs to do a bulk insert without having to then update their local objects with the inserted IDs. Then there's the whole replication issue, as mentioned by Galwegian.
The procedure method of incrementing must be thread safe. If not, you may not get unique numbers. Also, it must be fast, otherwise it will be an application bottleneck. The built in functions have already taken these two factors into account.
My main issue with auto-incrementing keys is that they lack any meaning
That's a requirement of a primary key, in my mind -- to have no other reason to exist other than identifying a record. If it has no real-world meaning, then it has no real-world reason to change. You don't want primary keys to change, generally speaking, because you have to search-replace your whole database or worse. I have been surprised at the sorts of things I have assumed would be unique and unchanging that have not turned out to be years later.
Here's the thing with auto incrementing integers as keys:
You HAVE to have posted the record before you get access to it. That means that until you have posted the record, you cannot, for example, prepare related records that will be stored in another table, or any one of a lot of other possible reasons why it might be useful to have access to the new record's unique id, before posting it.
The above is my deciding factor, whether to go with one method, or the other.
Using a unique identifiers would allow you to merge data from two different databases.
Maybe you have an application that collects data in multiple database and then "syncs" with a master database at various times in the day. You wouldn't have to worry about primary key collisions in this scenario.
Or, possibly, you might want to know what a record's ID will be before you actually create it.
One benefit is that it can allow the database/SQL to be more cross-platform. The SQL can be exactly the same on SQL Server, Oracle, etc...
The only reason I can think of is that the code was written before sequences were invented and the code forgot to catch up ;)
I would prefer to use a GUID for most of the scenarios in which the post's current method makes any sense to me (replication being a possible one). If replication was the issue, such a stored procedure would have to be aware of the other server which would have to be linked to ensure key uniqueness, which would make it very brittle and probably a poor way of doing this.
One situation where I use integer primary keys that are NOT auto-incrementing identities is the case of rarely-changed lookup tables that enforce foreign key constraints, that will have a corresponding enum in the data-consuming application. In that scenario, I want to ensure the enum mapping will be correct between development and deployment, especially if there will be multiple prod servers.
Another potential reason is that you deliberately want random keys. This can be desirable if, say, you don't want nosey browsers leafing through every item you have in the database, but it's not critical enough to warrant actual authentication security measures.
My main issue with auto-incrementing keys is that they lack any meaning.
For tables where certain fields provide uniqueness (whether alone or in combination with another), I'd opt for using that instead.
A useful side benefit of using a GUID primary key instead of an auto-incrementing one is that you can assign the PK value for a new row on the client side (in fact you have to do this in a replication scenario), sparing you the hassle of retrieving the PK of the row you just added on the server.
One of the downsides of a GUID PK is that joins on a GUID field are slower (unless this has changed recently). Another upside of using GUIDs is that it's fun to try and explain to a non-technical manager why a GUID collision is rather unlikely.
Galwegian's answer is not necessarily true.
With MySQL you can set a key offset for each database instance. If you combine this with a large enough increment it will for fine. I'm sure other vendors would have some sort of similar settings.
Lets say we have 2 databases we want to replicate. We can set it up in the following way.
increment = 2
db1 - offset = 1
db2 - offset = 2
This means that
db1 will have keys 1, 3, 5, 7....
db2 will have keys 2, 4, 6, 8....
Therefore we will not have key clashes on inserts.
The only real reason to do this is to be database agnostic (if different db versions use different auto-numbering techniques).
The other issue mentioned here is the ability to create records in multiple places (like in the central office as well as on traveling users' laptops). In that case, though, you would probably need something like a "sitecode" that was unique to each install that was prefixed to each ID.