Using INT or GUID as primary key - sql-server

I was trying to create an ID column in SQL server, VB.net that would provide a sequence of numbers for every new row created in a database. So I used the following technique to create the ID column.
select * from T_Users
ALTER TABLE T_Users
ADD User_ID INT NOT NULL IDENTITY(1,1) Primary Key
Then I registered few usernames into the database and it worked just fine. For example the first six rows would be 1,2,3,4,5,6. Then I registered 4 more users the NEXT day, but this time the ID numbers jumped from 6 to A very large number such as: 1,2,3,4,5,6,1002,1003,1004,1005. Then two days later, I registered two more users and the new rows read 3002,3004. So my question is why is it skipping such a large number every other day I register users. Is the technique I used to create the sequence wrong? If it is wrong can anyone please tell me how to do it right? Now as I was getting frustrated with the technique used above, alternatively I tried to use sequentially generated GUID values. The sequence of GUID values were generated fine. However, the only downside is, it generates a very long numbers (4 times the INT size). My question here is does using GUID have any significant advantage over INT?
Regards,

Upside of GUIDs:
GUIDs are good if you ever want offline clients to be able to create new records, as you will never get a primary key clash when the new records are synchronised back to the main database.
Downside of GUIDs:
GUIDS as primary keys can have an effect on the performance of the DB, because for a clustered primary key, the DB will want to keep the rows in order of the key values. But this means a lot of inserts between existing records, because the GUIDs will be random.
Using IDENTITY column doesn't suffer from this because the next record is guaranteed to have the highest value and so the row is just tacked on the end every time. No re-shuffle needs to happen.
There is a compromise which is to generate a pseudo-GUID which means you would expect a key clash every 70 years or so, but helps the indexing immensely.
The other downsides are that a) they do take up more storage space, and b) are a real pain to write SQL against, i.e. much easier to type UPDATE TABLE SET FIELD = 'value' where KEY = 50003 than UPDATE TABLE SET FIELD = 'value' where KEY = '{F820094C-A2A2-49cb-BDA7-549543BB4B2C}'
Your declaration of the IDENTITY column looks fine to me. The gaps in your key values are probably due to failed attempts to add a row. The IDENTITY value will be incremented but the row never gets committed. Don't let it bother you, it happens in practically every table.
EDIT:
This question covers what I was meaning by pseudo-GUID. INSERTs with sequential GUID key on clustered index not significantly faster
In SQL Server 2005+ you can use NEWSEQUENTIALID() to get a random value that is supposed to be greater than the previous ones. See here for more info http://technet.microsoft.com/en-us/library/ms189786%28v=sql.90%29.aspx

Is the technique I used to create the sequence wrong?
No. If anything your google skills are non-existing. A short look for "Sql server identity skipping values" will give you a TON of returns including:
SQL Server 2012 column identity increment jumping from 6 to 1000+ on 7th entry
and the canonical:
Why are there gaps in my IDENTITY column values?
You basically wrongly assume sql server will not optimize it's access for performance. Identity numbers are markers, nothing else, no assumption of having no gaps please.
In particular: SQL Server preallocates numbers in 1000 blocks and - if you restart the server (like on your workstation) the remainder is lost.
http://www.sqlserver-training.com/sequence-breaks-gap-in-numbers-after-restart-sql-server-gap-between-numbers-after-restarting-server/-
If you do a manual sqyuence instead (new nin sql server 2012) you can define the cache size for this (pregeneration) and set it to 1 - at the cost of slightly lower performance when you do a lot of inserts.
My question here is does using GUID have any significant advantage over INT?
Yes. You can have a lot more rows with GUID's than with int. For example, int32 is limited to about 2 billion rows. For some of us that is too low (I have tables in the 10 billion range) and even a 64 large int is limited. And a truly zetabyte database, you have to use a guid in sequence, self generated.
Any normal human does not see a difference as we all do not really deal with that many rows. And the larger size makes a lot of things slower (larger key size = larger space in indices = larger indices = more memory / io for the same operation). Plus even your sequential id will jump.
Why not just adjust your expectation to reality - identity is not meant to be without gaps - or use a sequence with cache 1.

Related

SQL Server : primary key auto increment - what about deleted rows and free key values?

I'm kind of new to SQL and databases and there's one thing that bothers me.
I'm using SQL Server for my ASP.NET MVC project and my database and its tables were auto-generated by Entity Framework using a code-first approach.
I have a table for book collections - just CollectionId and Name columns.
During development I've made many inserts and deletes in this table and right now it has 10 rows with Id's 1 to 10 (the initial entries). But when I add a new one it has the Id set to 37. Obviously in the past there were entries with Id up to 36, but there are now gone and these numbers seem to be free.
Then why a new entry does not have the Id set to 11? Is it a kind of limitation or maybe a security feature?
Thank you for answers.
This is default behavior when we define identity column. Whenever we perform delete operations there will be gaps in records for identity column.
Remarks from MSDN
If an identity column exists for a table with frequent deletions, gaps can occur between identity values. If this is
a concern, do not use the IDENTITY property. However, to ensure that
no gaps have been created or to fill an existing gap, evaluate the
existing identity values before explicitly entering one with SET
IDENTITY_INSERT ON.
IDENTITY
In addition to the other answer, it also has to do with performance of the server. The server typically cache's a group of ID's in memory to make assignment much faster, since the next number has to be stored on disk somewhere. So if the server allocates 100 numbers at a time, it only has to write to disk 1 out of every 100 usages (inserts) of the identity.
Trying to maintain gaps in the sequence would suck up a lot of time.
If you create a new table, insert a single row, kill the server and restart, you'll find the next insert will most likely contain a gap of whatever that number of cached values contains.

Changing to newsequentialid() on an existing table with 4 million records

My Question is almost the same like Changing newid() to newsequentialid() on an existing table
if I change to newsequentialid() than the index should be more compact, correct?
And what happens if the sequence is hitting exists ID's while inserting new records? Will the database check that before?
It will have less fragmentation so you could say its' more compact. But it will still be the same size per key (16 bytes + overhead) for a guid key. The benefit of using a sequential guid vs a nonsequential guid is that you have less chances for a page split. A page split is where a logical page has to have a record inserted, but would be more than the page is allowed to hold, so the page is "split"; half to one page and half to another. Sometimes a page split causes another page to split, and theoretically you can have a cascading and costly page split by just inserting one new record. When you use a sequential key, it's less likely that you'll randomly triggger a page split somewhere in the middle of your index, so you reduce the likelihood of those happening. Using a sequential guid also helps optimize range scans (e.g selecting between one value and another value) but with a GUID, it's very unlikely that you'll end up doing many range based scans, since the value is basically meaningless.
What happens when the sequence hits an existing iD? You get a PK violation. SQL doesn't ensure that a GUID can only be used once. Sequential ID's start at a new seed every time the server is restarted so, in theory, you could skip back in the sequence and then wind up covering the same value twice. However, as with GUIDs in general, the liklihood of this happening is so astronomically small as to be statistically insignificant.
As with everything, the cost and benefits depend on your specific scenario. If you're looking to replace a GUID key with a sequential key, see if it's possible to use an int or bigint surrogate key instead of a GUID, because generally, all things being equal, an integer will outperform a guid in every case. 4 Million records will trivially fit into an INT data type and even more trivially into a bigint.
Hope this helps.

Does insert of GUID PK take longer and cause paging?

On choosing GUID vs identity INT for primary key…
Is it true that an insert of a GUID is going to take longer because it has to search for the sequential place in the index to put that GUID? And that it can cause paging?
If this is true, maybe we should have used identity INT all along?
Thanks.
It is true that a newer GUID value can be lesser than a GUID previously inserted. Yes, enough of those can force the index to expand pages. An INT or BIGINT auto increment/identity column wouldn't run into that unless you inserted them lower identity manually with identity insert ON.
if you can't change from a GUID for some reason check out
NEWSEQUENTIALID()
http://msdn.microsoft.com/en-us/library/ms189786.aspx
It will find a greater than previously used GUID. The caveat is that the "great than" portion only holds true until machine restarts.
Brad
I think you mean CLUSTERED INDEX more so than PRIMARY KEY. Of course, ANY index will get fragmented rather quickly when using NEWID() as the values are not ever-increasing and hence cause page-splits.
I provided a rather detailed explanation of the effects of non-sequential INSERT operations here:
What is fragmenting my index on a table with stable traffic?
Yes, NEWSEQUENTIALID() is far better, at least in terms of GUIDs, in that it is sequential, BUT, it is not a guarantee against page splits. The issue is that the starting value is determined based on a value that gets reset when the server is restarted. Hence, after a restart, the new sequential range can start lower than the previous first record. The first INSERT at that point will cause a page split but from there should be fine.
Also, a UNIQUEIDENTIFIER is 16 bytes as compared to INT being 4 and even BIGINT being 8. That increased size makes each row take up more space and hence less rows fit on each 8k datapage. Allocating new pages takes time, especially if you need to grow the datafile to fit the new pages (and the larger the rows the faster the datafile fills). Hence, yes, you most likely should have gone with INT IDENTITY from the start.
In such cases that an external and/or portable identifier is needed (this is where GUIDs are very handy), then you should still start with an INT (or even BIGINT) IDENTITY field as the clustered PK and have another column be the GUID with a UNIQUE INDEX placed on it. This is known as an alternate key and works well in these situations.

8 bytes for timestamp or 6 bytes for timestamp at COMB GUID at SQLServer

Thanks to the wonderful article The Cost of GUIDs as Primary Keys, we have the COMB GUID. Based on current implementation, there are 2 approaches:
use last 6 bytes for timestamp: GUIDs as fast primary keys under multiple databases
use last 8 bytes for timestamp by using windows tick: GUID COMB strategy in EF4.1 (CodeFirst)
We all know that for 6 bytes timestamp at GUID, there would more bytes for random bytes to reduce the collision of the GUID. However more GUID with same timestamp would be created and those are not sequential at all. With that, 8 bytes timestamp would be preferred.
So it seems a hard choice. Based on article above GUIDs as fast primary keys under multiple databases, it says:
Before we continue, a short footnote about this approach: using a 1-millisecond-resolution timestamp means that GUIDs generated very close together might have the same timestamp value, and so will not be sequential. This might be a common occurrence for some applications, and in fact I experimented with some alternate approaches, such as using a higher-resolution timer such as System.Diagnostics.Stopwatch, or combining the timestamp with a "counter" that would guarantee the sequence continued until the timestamp updated. However, during testing I found that this made no discernible difference at all, even when dozens or even hundreds of GUIDs were being generated within the same one-millisecond window. This is consistent with what Jimmy Nilsson encountered during his testing with COMBs as well
Just wonder if someone who knows database internal could share some lights about above observation. Is it because that database server just store the data in the memory and only write to disk when it reaches certain threshold? Thus the reorder of inserted data with non sequence GUID with same time stamp would happen in general in memory and thus minimal performance penalty.
Update:
Based on our testing, the COMB GUID could not reduce the table fragmentation as it is claimed over the internet compared with random GUID. It seems the only way right now is to use SQL Server to generate the sequential GUID.
The article you referenced is from 2002 and is very old. Just use newsequentialid (available in SQL Server 2005 and up). This guarantees that each new id you generate is greater than the previous one, solving the index fragmentation/page split issue.
Another aspect I'd like to mention, though, that the writer of that article glossed over, is that using 16 bytes when you only need 4 is not a good idea. Let's say you have a table with 500,000 rows averaging 150 bytes not including the clustered column, and the table has 3 nonclustered indexes (which repeat the clustered column in each row), each in turn with rows averaging 4 bytes, 25 bytes, and 50 bytes not counting the clustered column.
The storage requirements at perfect 100% fill factor are then (all numbers in megabytes except where %):
Item Clust 50 25 4 Total
---- ----- ----- ----- ----- ------
GUID 79.1 31.5 19.6 9.5 139.7
int 73.4 25.7 13.8 3.8 116.7
%imp 7.2% 18.4% 29.6% 60.0% 16.5%
In the nonclustered index having just one int column of 4 bytes (a common scenario), switching the clustered index to an int makes it 60% smaller! This translates directly into a 60% performance improvement for any scans on the table--and that's conservative, because with smaller rows, page splits will occur less often and the fragmentation will stay better longer.
Even in the clustered index itself, there's still a 7.2% performance improvement, which is not nothing, at all.
What if you used GUIDs throughout your entire database, which had tables with a similar profile as this where switching to int would yield a 16.5% reduction in size, and the database itself was 1.397 Terabytes in size? Your whole database would be 230 Gb larger (refer to the Total column, 139.7 - 116.7). That translates into real money in the real world for high-availability storage. It moves your disk purchase schedule earlier in time which is harmful to your company's bottom line.
Do not use larger data types than necessary, ever. It's like adding weight to your car for no reason: you will pay for it (if not in speed, then in fuel economy).
UPDATE
Now that I know you are creating the GUID in your client-side code, I can see more clearly the nature of your problem. If you are able to defer creating the GUID until row insertion time, here's one way to accomplish that.
First, set a default for your CustomerID column:
ALTER TABLE dbo.Customer ADD CONSTRAINT DF_Customer_CustomerID
DEFAULT (newsequentialid()) FOR Customer;
Now you don't have to specify what value to insert for CustomerID in any INSERT, and your query could look like this:
DECLARE #Name varchar(100) = 'Acme Spy Devices';
INSERT dbo.Customer (Name)
OUTPUT inserted.CustomerID -- a GUID
VALUES (#Name);
In this very simple example, you have inserted a new row to the Customer table, and returned a rowset to the client containing the just-created value, all in one query.
If you wanted to explicitly insert VALUES (newsequentialid(), #Name) that would work, too.

optimizing sql server database

My database has one very large table with over 2 billion rows with 3 columns.
Id(uniqueidentity), Type(int, between 0-10. 0 = most used. 10 = least used), Data(Binary data between 1-10MB)
What are some ways I can optimize this database? (primarily select queries)
*Note: I might add a few more columns to this table later (eg: location, date...)
Assuming that the id column is the clustered index key, and assuming that by uniqueidentity you mean uniqueidentifier:
do you need the uniqueidentifier type? Why?
What other alternatives have you considered?
Do you populate the data using sequential GUIDs or not?
GUIDs are a notoriously poor choise for clustered keys. See GUIDs as PRIMARY KEYs and/or the clustering key for a more detailed discussion:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 2^63-1 rows
Also read Disk space is cheap...That's not the point! as a follow up.
Other than this, you need to do your homework and post the required details for such a question: exact table and index definition, prevalent data access pattern (by key, by range, filters sort order, joins etc etc).
Have you done any work to identify problems so far? If not, start with Waits and Queues, a proven methodology to identify performance bottlenecks. Once you measure and find places that need improvement, we can advise how to improve.
Add an Index(es). Decide which column(s) are the most appropriate clustered index.
Decide if storing 10MB of binary data in each (otherwise small) row is a good use of a database
[Updated in response to Remus's comment]

Resources