On choosing GUID vs identity INT for primary key…
Is it true that an insert of a GUID is going to take longer because it has to search for the sequential place in the index to put that GUID? And that it can cause paging?
If this is true, maybe we should have used identity INT all along?
Thanks.
It is true that a newer GUID value can be lesser than a GUID previously inserted. Yes, enough of those can force the index to expand pages. An INT or BIGINT auto increment/identity column wouldn't run into that unless you inserted them lower identity manually with identity insert ON.
if you can't change from a GUID for some reason check out
NEWSEQUENTIALID()
http://msdn.microsoft.com/en-us/library/ms189786.aspx
It will find a greater than previously used GUID. The caveat is that the "great than" portion only holds true until machine restarts.
Brad
I think you mean CLUSTERED INDEX more so than PRIMARY KEY. Of course, ANY index will get fragmented rather quickly when using NEWID() as the values are not ever-increasing and hence cause page-splits.
I provided a rather detailed explanation of the effects of non-sequential INSERT operations here:
What is fragmenting my index on a table with stable traffic?
Yes, NEWSEQUENTIALID() is far better, at least in terms of GUIDs, in that it is sequential, BUT, it is not a guarantee against page splits. The issue is that the starting value is determined based on a value that gets reset when the server is restarted. Hence, after a restart, the new sequential range can start lower than the previous first record. The first INSERT at that point will cause a page split but from there should be fine.
Also, a UNIQUEIDENTIFIER is 16 bytes as compared to INT being 4 and even BIGINT being 8. That increased size makes each row take up more space and hence less rows fit on each 8k datapage. Allocating new pages takes time, especially if you need to grow the datafile to fit the new pages (and the larger the rows the faster the datafile fills). Hence, yes, you most likely should have gone with INT IDENTITY from the start.
In such cases that an external and/or portable identifier is needed (this is where GUIDs are very handy), then you should still start with an INT (or even BIGINT) IDENTITY field as the clustered PK and have another column be the GUID with a UNIQUE INDEX placed on it. This is known as an alternate key and works well in these situations.
Related
My Question is almost the same like Changing newid() to newsequentialid() on an existing table
if I change to newsequentialid() than the index should be more compact, correct?
And what happens if the sequence is hitting exists ID's while inserting new records? Will the database check that before?
It will have less fragmentation so you could say its' more compact. But it will still be the same size per key (16 bytes + overhead) for a guid key. The benefit of using a sequential guid vs a nonsequential guid is that you have less chances for a page split. A page split is where a logical page has to have a record inserted, but would be more than the page is allowed to hold, so the page is "split"; half to one page and half to another. Sometimes a page split causes another page to split, and theoretically you can have a cascading and costly page split by just inserting one new record. When you use a sequential key, it's less likely that you'll randomly triggger a page split somewhere in the middle of your index, so you reduce the likelihood of those happening. Using a sequential guid also helps optimize range scans (e.g selecting between one value and another value) but with a GUID, it's very unlikely that you'll end up doing many range based scans, since the value is basically meaningless.
What happens when the sequence hits an existing iD? You get a PK violation. SQL doesn't ensure that a GUID can only be used once. Sequential ID's start at a new seed every time the server is restarted so, in theory, you could skip back in the sequence and then wind up covering the same value twice. However, as with GUIDs in general, the liklihood of this happening is so astronomically small as to be statistically insignificant.
As with everything, the cost and benefits depend on your specific scenario. If you're looking to replace a GUID key with a sequential key, see if it's possible to use an int or bigint surrogate key instead of a GUID, because generally, all things being equal, an integer will outperform a guid in every case. 4 Million records will trivially fit into an INT data type and even more trivially into a bigint.
Hope this helps.
I was trying to create an ID column in SQL server, VB.net that would provide a sequence of numbers for every new row created in a database. So I used the following technique to create the ID column.
select * from T_Users
ALTER TABLE T_Users
ADD User_ID INT NOT NULL IDENTITY(1,1) Primary Key
Then I registered few usernames into the database and it worked just fine. For example the first six rows would be 1,2,3,4,5,6. Then I registered 4 more users the NEXT day, but this time the ID numbers jumped from 6 to A very large number such as: 1,2,3,4,5,6,1002,1003,1004,1005. Then two days later, I registered two more users and the new rows read 3002,3004. So my question is why is it skipping such a large number every other day I register users. Is the technique I used to create the sequence wrong? If it is wrong can anyone please tell me how to do it right? Now as I was getting frustrated with the technique used above, alternatively I tried to use sequentially generated GUID values. The sequence of GUID values were generated fine. However, the only downside is, it generates a very long numbers (4 times the INT size). My question here is does using GUID have any significant advantage over INT?
Regards,
Upside of GUIDs:
GUIDs are good if you ever want offline clients to be able to create new records, as you will never get a primary key clash when the new records are synchronised back to the main database.
Downside of GUIDs:
GUIDS as primary keys can have an effect on the performance of the DB, because for a clustered primary key, the DB will want to keep the rows in order of the key values. But this means a lot of inserts between existing records, because the GUIDs will be random.
Using IDENTITY column doesn't suffer from this because the next record is guaranteed to have the highest value and so the row is just tacked on the end every time. No re-shuffle needs to happen.
There is a compromise which is to generate a pseudo-GUID which means you would expect a key clash every 70 years or so, but helps the indexing immensely.
The other downsides are that a) they do take up more storage space, and b) are a real pain to write SQL against, i.e. much easier to type UPDATE TABLE SET FIELD = 'value' where KEY = 50003 than UPDATE TABLE SET FIELD = 'value' where KEY = '{F820094C-A2A2-49cb-BDA7-549543BB4B2C}'
Your declaration of the IDENTITY column looks fine to me. The gaps in your key values are probably due to failed attempts to add a row. The IDENTITY value will be incremented but the row never gets committed. Don't let it bother you, it happens in practically every table.
EDIT:
This question covers what I was meaning by pseudo-GUID. INSERTs with sequential GUID key on clustered index not significantly faster
In SQL Server 2005+ you can use NEWSEQUENTIALID() to get a random value that is supposed to be greater than the previous ones. See here for more info http://technet.microsoft.com/en-us/library/ms189786%28v=sql.90%29.aspx
Is the technique I used to create the sequence wrong?
No. If anything your google skills are non-existing. A short look for "Sql server identity skipping values" will give you a TON of returns including:
SQL Server 2012 column identity increment jumping from 6 to 1000+ on 7th entry
and the canonical:
Why are there gaps in my IDENTITY column values?
You basically wrongly assume sql server will not optimize it's access for performance. Identity numbers are markers, nothing else, no assumption of having no gaps please.
In particular: SQL Server preallocates numbers in 1000 blocks and - if you restart the server (like on your workstation) the remainder is lost.
http://www.sqlserver-training.com/sequence-breaks-gap-in-numbers-after-restart-sql-server-gap-between-numbers-after-restarting-server/-
If you do a manual sqyuence instead (new nin sql server 2012) you can define the cache size for this (pregeneration) and set it to 1 - at the cost of slightly lower performance when you do a lot of inserts.
My question here is does using GUID have any significant advantage over INT?
Yes. You can have a lot more rows with GUID's than with int. For example, int32 is limited to about 2 billion rows. For some of us that is too low (I have tables in the 10 billion range) and even a 64 large int is limited. And a truly zetabyte database, you have to use a guid in sequence, self generated.
Any normal human does not see a difference as we all do not really deal with that many rows. And the larger size makes a lot of things slower (larger key size = larger space in indices = larger indices = more memory / io for the same operation). Plus even your sequential id will jump.
Why not just adjust your expectation to reality - identity is not meant to be without gaps - or use a sequence with cache 1.
Are there performance (or other) issues using the Binary datatype for Primary Keys. The database has a large number of large tables that are regularly joined using these keys. The indexes are clustered. I believe that these can't be automatically incremented (as an Identity field).
In SQL Server the Primary Key is by default also the key for the clustered index.
The Primary Key itself only needs to be unique and not nullable. There are no other restrictions.
The clustered index key however should be as short as possible. In most cases an ever increasing value is also preferred. The reason is that an index's depth is directly affected by the length of the index key. That is true for any index type. The clustered index key however gets automatically appended to each other index key on that table therefore multiplying the negative effect of a long key. That means in most cases an INT IDENTITY is a good choice.
If your Primary Key is non-clustered keeping it short is not that important. However, you are using it for joins. That means you probably have an index on this key on each child table too, therefore multiplying the problem again. So again, a automatically increasing surrogate key is probably the better choice.
This all is true for many if not most cases. However, there are always exceptions. You do not give a lot of information about your use case so the answer has to be general in nature. Make sure you test the performance of read as well as modification operations in your environment with realistic data before deciding which way to go.
As a final remark, a 4 byte BINARY and an INT are probably very close in performance. A difference you might see if the values are not created in a increasing binary-sorted way. That can cause page splits during insert operations and therefore impact your write performance.
Is it really that bad to use "varchar" as the primary key?
(will be storing user documents, and yes it can exceed 2+ billion documents)
It totally depends on the data. There are plenty of perfectly legitimate cases where you might use a VARCHAR primary key, but if there's even the most remote chance that someone might want to update the column in question at some point in the future, don't use it as a key.
If you are going to be joining to other tables, a varchar, particularly a wide varchar, can be slower than an int.
Additionally if you have many child records and the varchar is something subject to change, cascade updates can causes blocking and delays for all users. A varchar like a car VIN number that will rarely if ever change is fine. A varchar like a name that will change can be a nightmare waiting to happen. PKs should be stable if at all possible.
Next many possible varchar Pks are not really unique and sometimes they appear to be unique (like phone numbers) but can be reused (you give up the number, the phone company reassigns it) and then child records could be attached to the wrong place. So be sure you really have a unique unchanging value before using.
If you do decide to use a surrogate key, then make a unique index for the varchar field. This gets you the benefits of the faster joins and fewer records to update if something changes but maintains the uniquess that you want.
Now if you have no child tables and probaly never will, most of this is moot and adding an integer pk is just a waste of time and space.
I realize I'm a bit late to the party here, but thought it would be helpful to elaborate a bit on previous answers.
It is not always bad to use a VARCHAR() as a primary key, but it almost always is. So far, I have not encountered a time when I couldn't come up with a better fixed size primary key field.
VARCHAR requires more processing than an integer (INT) or a short fixed length char (CHAR) field does.
In addition to storing extra bytes which indicate the "actual" length of the data stored in this field for each record, the database engine must do extra work to calculate the position (in memory) of the starting and ending bytes of the field before each read.
Foreign keys must also use the same data type as the primary key of the referenced parent table, so processing further compounds when joining tables for output.
With a small amount of data, this additional processing is not likely to be noticeable, but as a database grows you will begin to see degradation.
You said you are using a GUID as your key, so you know ahead of time that the column has a fixed length. This is a good time to use a fixed length CHAR(36) field, which incurs far less processing overhead.
I think int or bigint is often better.
int can be compared with less CPU instructions (join querys...)
int sequence is ordered by default -> balanced index tree -> no reorganisation if you use an PK as clustered index
index need potentially less space
Use an ID (this will become handy if you want to show only 50 etc...). Than set a constraint UNIQUE on your varchar with the file-names (I assume, that is what you are storing).
This will do the trick and will increase speed.
I am working on database standards for a new database my company is starting. One of the things we are trying to define is Primary Key and Clustered Index rules in relation to UniqueIdentifiers.
(NOTE: I do not want a discussion on the pros and cons of using a UniqueIdentifier as a primary key or clustered index. There is a ton of info on the web about that. This is not that discussion.)
So here is the scenario that has me worried:
Say I have a table with a UniqueIdentifier as the clustered index and primary key. Lets call it ColA. I set the default value for ColA to be NewSequentialId().
Using that NewSequentialId() I insert three sequential rows:
{72586AA4-D2C3-440D-A9FE-CC7988DDF065}
{72586AA4-D2C3-440D-A9FE-CC7988DDF066}
{72586AA4-D2C3-440D-A9FE-CC7988DDF067}
Then I reboot my server. The docs for NewSequentialId say that "After restarting Windows, the GUID can start again from a lower range, but is still globally unique."
So the next starting point can be lower than the previous range.
So after the restart, I insert 3 more values:
{35729A0C-F016-4645-ABA9-B098D2003E64}
{35729A0C-F016-4645-ABA9-B098D2003E65}
{35729A0C-F016-4645-ABA9-B098D2003E66}
(I am not sure exactly how the guid is represented in the database, but lets assume since this one starts with 3 and the previous ones started with 7 that the 3 ones are "smaller" than the 7 ones.)
When you do an insert that is in the middle of a clustered index, a remapping of the index has to happen. (At least so my DBA has told me.) And every time I reboot I run the risk of having my new UniqueIdentifier range be right in the middle of other previous ranges.
So my question is: Since the next set of UniqueIdentifiers will be smaller than the last set, will every insert cause my clustered index to shuffle?
And if not, why? Does SQL Server know that I am using NewSequentialId? Does it some how compensate for that?
If not, then how does it know what I will insert next? Maybe the next million inserts will start with 3. Or maybe they will start with 7. How does it know?
Or does it not know and just keeps everything in order. If that is the case then one reboot could massively affect performance. (Which makes me think I need my own custom NewSequentialId that is not affected by reboots.) Is that correct? Or is there some magic I am not aware of?
EDIT: GUID as a clustered index is strongly discouraged in my standard. As I said above, there are many reasons that this is a bad idea. I am trying to find out if this is another reason why.
Normally you will create your indexes with an appropriate FILL FACTOR to leave empty space in all your pages for just such a scenario. That being said, the clustered index does get reordered once the empty space is filled.
I know you don't want to discuss using GUID as a clustered key, but this is one of the reasons that it's not a recommended practice.
What will happen is that you will have an increasing volume of page splits, which will lead to a very high level of fragmentation as you keep inserting rows, and you will need to rebuild your index at a higher frequency to keep performance in line.
For a full treatment on the topic, there's no better source than
Kim
Tripp's
Blog
As a side note, when you are considering creating your own NewSequentialID creation function, you probably have a design issue and should reconsider your plan.