SQL Server unique GUID - sql-server

I understand in SQL Server GUIDs are MOSTLY unique, and also that the likelihood of a collision is remote, yet at the same time someone must win the lottery so I feel like it makes some sense to prepare for the possibility.
Which is faster/better practice
Using a technique where I assign a new GUID directly by just inserting a row and checking for an error (##ERROR <> 0) and repeating until I don't get an error [which I suppose in theory would only at worst be once...]
or using an approach like this
DECLARE #MyGUID uniqueidentifier
SELECT #MyGUID = NewID()
if exists(select * from tablename where UserID=#MyGUID)
and looping over that till I find one not in use.
I like the 2nd approach because I can then have the GUID for use later on in the Stored Procedure so I'm currently leaning towards that one.

If you have at least one network adapter in your computer, then your GUIDs will be unique. IF you don't have, then the possibility of colliding with a guid generated on another machine exists in theory, but is never ever going to happen to you. Writing code to guard against duplicate guids is a total waste of time.
That being said, to enforce uniqueness of anything in a relational database is done by only one means: create a unique constraint on the data:
ALTER TABLE tablename ADD CONSTRAINT uniqueUSerID UNIQUE UserID;

The chances of a collision happening on your computer are much less likely than you winning the lottery.
Use NEWID() and don't worry about a hugely unlikely event.
If you declared the column as unique, it will not get persisted anyways.

To actually answer the question and not debate the merit of the question/perceived problem.
The first implementation will be the one you want to use for two reasons:
Running a check exists before doing the insert for every single record your dealing with will in the end result in more resources being dedicated to something that is extremely unlikely to happen. (also the database will ensure the constraint on the column as well so if it does collide the data wont be committed)
If you have an error on the first one, you can take a little extra time and handle the ERROR that is returned.
You can combine the two and declare the new uniqueidentifier insert into the table (if it works continue using it, else retry with a new one and then continue using it).
Generally you want to program for the most likely situation first and then handle the exceptions last.

The probability of a collision is so low - far less than that of winning the lottery - that it's more likely that cosmic rays will strike the RAM in the machine and cause your program to fail in some other, arbitrary manner. It's more likely that your error handling for this case will (now or eventually) contain an error that will lead to failure. Researchers who deliberately try to locate collisions using extensive computing power haven't succeeded so far in finding even one. I don't think it's worth your effort or time to handle this error case, at least until you've handled the wide range of far more likely hardware and software errors that may occur. I realize this is counterintuitive, but trust me.

Generally, I only use a GUID primary key if I'm dealing with distributed databases where clients need to add new records while offline. If this isn't your scenario, you're better off using an autoincremented int for your primary key.

Related

Removing PAGELATCH with randomized ID instead of GUID

We have two tables which receive 1 million+ insertions per minute. This table is heavily indexed, which also can’t be removed to support business requirements. Due to such high volume of insertions, we are seeing PAGELATCH_EX and PAGELATCH_SH. These locks further slowdown insertions.
A commonly accepted solution is to change the identity column to GUID so that insertions are written on random page every time. We can do this but changing IDs will trigger a need for the development cycle of migration scripts so that existing production data can be changed.
I tried another approach which seems to be working well in our load tests. Instead of changing to GUID, We are now generating IDs in a randomized pattern using following logic
SELECT #ModValue = (DATEPART(NANOSECOND, GETDATE()) % 14);
INSERT xxx(id)
SELECT NEXT VALUE FOR Sequence * (#ModValue + IIF(#ModValue IN (0,1,2,3,4,5,6), 100000000000,-100000000000))
It has eliminated PAGELATCH_EX and PAGELATCH_SH locks and our insertions are quite fast now. I also think GUID as PK of such a critical table is less efficient then a bigint ID column.
However, some of team members are sceptical on this as IDs with negative values that too generated on random basis is not a common solution. Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
I am wondering what the community’s take on this solution. If you could please point any disadvantage of approach suggested, that will be highly appreciated.
However, some of team members are skeptical on this as IDs with negative values that too generated on random basis is not a common solution
You have an uncommon problem, and so uncommon solutions might need to be entertained.
Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
Sure. The system as it exists now has a high (but not perfect) correlation between IDs and time. That is, in general a higher ID for a row means that it was created after one with a lower ID. So it's convenient to order by IDs as a stand-in for ordering by time. If that's something that they need to do (i.e. order the data by time), give them a way to do that in the new proposal. Conversely, play out the hypothetical scenario where you're explaining to your CTO why you didn't fix performance on this table for your end users. Would "so that our support personnel don't have to change the way they do things" be an acceptable answer? I know that it wouldn't be for me but maybe the needs of support outweigh the needs of end users in your system. Only you (and your CTO) can answer that question.

What advantages do constraints provide to a database?

I realize this question may seem a little on the "green" side, but after the number of "enterprise" or "commercial" databases I've encountered I've begun to ask this question. What advantages to constraints provide to a database? I'm asking more about Foreign Key constraints rather than Unique constraints. Do they offer performance gains, or just data integrity?
I've been rather surprised at the number of relational databases without foreign keys or even without specified primary keys (just constraints on fields being not null or a unique constraint on the field).
Thoughts?
"just" data integrity? You say that like it's a minor thing. In all applications, it's critical. So yes, it provides that, and it's a huge benefit.
Data integrity is what they offer. If anything they have a performance cost (a very minor one at least).
They provide both performance and data integrity, and the latter is paramount to any serious system. I cringe every time I see a database without any foreign keys and where all integrity is done through triggers (if at all). And I saw quite a bit of those out there.
The following, assuming you get the constraint right in the first place:-
Your data will be valid with respect to the constraint
The database knows your data will be valid with respect to the constraint and can use this when querying or updating the database (e.g. removing an unnecessary join for a query on a view)
The constraint is documented for future users of the database
A violation of the constraint will be caught as soon as possible; not in some later unrelated process that fails
In relational theory, a database that allows inconsistent data isn't really a relational database. Foreign keys are necessary for data integrity and consistency to keep the database "relational"; i.e. the logical model of the database is always correct.
In practical terms, it's usually easier to define a foreign key and let the DB engine handle making sure the relation is valid. The other options are:
nothing - guaranteed data corruption
at some point
DB triggers - which
will usually be slower and less
performant
application code - which
will eventually cause problems when
either you forget to call the right
code or another application accesses
the database.
Data is an asset. Lots of textbooks state that.
But it is actually wrong. It should rather say "correct data is an asset, incorrect data is a liability".
And database constraints give you the best possible guarantee that data is correct.
In some DBMSs (e.g. Oracle) constraints can actually improve the performance of some queries, since the optimiser can use the constraints to gain knowledge about the structure of the data. For some examples, see this Oracle Magazine article.
I would say all required constraints must be in the database. Foreign key constraints prevent unusable data. They aren't a nice to have - they are a requirement unless you want a useless database. Foreign keys may hurt performance of deletes and updates but that is OK. Is it better to take a little longer to do a delete (or to tell the application not to delete this person because he has orders in the system) or to delete the user but not his data? Lack of foreign keys may cause unexpected and often serious problems in querying the data. For instance the cost reports may depend on all the tables having related data and so may fail to show important data because one or more tables have nothing to join to.
Unique constraints are also a requirement of any decent databse. If a field or group of fields must be unique, to fail to define this at the database leve is tocreate data problems that are extremely hard to fix.
You don't mention other constraints but you should. Any business rule that must always be applied to all data in the table should always be applied in the database through a datatype (such as a datatime datatype which willnot accept '02\31\2009' as a valid date), a constraint (say one that does not allow the field to have a value greater than 100) or through a trigger is the logic is so complex it cannot be handled by an ordinary constraint. (Triggers are tricky to write if you don't know what you are doing, so if you have logic this complex, you hopefully have adatabase professional on your team.) The order is important to. Datatypes are the first choice, followed by constraints, followed by triggers as a last choice.
Simpler Application Code
One nice thing they provide is that your application code has to do a lot less error checking and validation. Contrast these two bits of code and multiply by thousands of operations and you can see there's a big win.
get department number for employee # it's good coz of constraints
do something with department number
vs.
get department number for employee
if department number is empty
...
else if department number not in list of good department numbers
....
else
do something with department number
Of course, people that ignore constraints probably don't put a lot of effort into code validation anyway... :-/
Oh, and if the data constraints change, it's a database configuration issue and not a code change issue.
Integrity constraints are especially important when you integrate several applications using a shared database.
You may be able to properly manage data integrity in a single application's code (and even if you don't, at least the broken data affects only that application), but with multiple apps it gets hairy (and at the least redundant).
"Oh, and if the data constraints change, it's a database configuration issue and not a code change issue."
Unless it's a constraint that disappears from the design. In that case, there definitely IS a code impact, because some code might have been written that depends on that removed constraint being there.
That must always be taken in consideration when "laxing" or "removing" any declared constraint.
Other than that, you are of course completely right.

Are there other strategies available for auto-incrementing a primary key besides the default x+1?

Are there other strategies available for auto-incrementing a primary key? So instead of the usual x+n where n is 1, could n be some other number such as 5? Or perhaps a random number but such that the next ID is still unique?
I'm not sure what the uses would be, or even if it's a good idea, I'm just curious.
In T-SQL (MS SQL Server) you use the IDENTITY property:
"IDENTITY [(seed, increment )]
seed Is the value that is used for the very first row loaded into the table.
increment Is the incremental value that is added to the identity value of the previous row that was loaded."
One thing you really shouldn't do is try and roll your own unless you really know what you are doing and can handle concurrency etc.
Edit: Why you need to be careful:
Here's some pseudo SQL that shows the danger of using your own increment etc:
1. #NextID = SELECT MAX(ID) FROM MyTable;
2. INSERT INTO MyTable(ID + 1, Name) VALUES (#NextID, "Dan");
Now, in most cases this would work OK. However, in a high volume database there is always a chance that another insert statement might occur between steps 1 and 2. You'd then end up with a primary key violation and an error. It may not happen often, but when it does it would be a pain to pinpoint.
The way around this is to have a LOCK before step 1 and release it after step 2. This effectively queues alls transactions and the database runs in "single user" mode. In high volume systems this can be very detrimental to performance.
This is why all major databases have sequences or built-in methods for auto-incremementing primary keys.
In the case of Master-Master replication in MySQL, it is important to have the different servers have completely separate auto increments, to avoid collisions.
The keys are kept separate by specifying an offset and increment for each server. For example,
auto_increment_increment= 2
auto_increment_offset = 2
We also use different increment schemes for master/master replication situations. I think we have one server counting up and another counting down. Even/odd is perfectly valid, too.
Depends on the RDBMS you are using. In oracle it would be this (provided you set a trigger up for the primary key):
SQL> create sequence fun_seq
start with 8
increment by 2;
For random ids (and ones you can make unique across servers), you can use GUID fields instead of int fields. These have problems and issues of their own and you should read about them in depth before committing to such a design however.

GUID vs INT IDENTITY [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you like your primary keys?
I'm aware of the benefits of using a GUID, as well as the benefits of using and INT as a PK in a database. Considering that a GUID is in essence a 128 bit INT and a normal INT is 32 bit, the INT is a space saver (though this point is generally moot in most modern systems).
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
Kimberley Tripp (SQLSkills.com) has an article on using GUID's as primary keys. She advices against it because of the unnecessary overhead.
To answer your question:
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
I would use a GUID if my system would have an online/offline version that inside the offline version you can save data and that data is transferred back to the server one day during a synch. That way, you are sure that you won't have the same key twice inside your database.
We have Guids in our very complex enterprise software everywhere. Works smoothly.
I believe Guids are semantically more suitable to serve as identifiers. There is also no point in unnecessarily worrying about performance until you are faced with that problem. Beware premature optimization.
There is also an advantage with database migration of any sort. With Guids you will have no collisions. If you attempt to merge several DBs where ints are used for identity, you will have to replace their values. If these old values were used in urls, they will now be different following SEO hit.
Apart from being a poor choice when you need to synchronize several database instances, INT's have one drawback I haven't seen mentioned: inserts always occur at one end of the index tree. This increases lock contention when you have a table with a lot of movement (since the same index pages have to be modified by concurrent inserts, whereas GUID's will be inserted all over the index). The index may also have to be rebalanced more often if a B* tree or similar data structure is used.
Of course, int's are easier on the eye when doing manual queries and report construction, and space consumption may add up through FK usages.
I'd be interested to see any measurements of how well e.g. SQL Server actually handles insert-heavy tables with IDENTITY PK's.
the INT is a space saver (though this
point is generally moot in most modern
systems).
Not so. It may seem so at first glance, but note that the primary key of each table will be repeated multiple times throughout the database in indexes and as foreign key in other tables. And it will be involved in nearly any query containing its table - and very intensively when it's a foreign key used for a join.
Furthermore, remember that modern CPUs are very, very fast, but RAM speeds have not kept up. Cache behaviour becomes therefore increasingly important. And the best way to get good cache behaviour is to have smaller data sets. So the seemingly irrelevant difference between 4 and 16 bytes may well result in a noticeable difference in speed. Not necessarily always - but it's something to consider.
When comparing values such as Primary to Foreign key relationship, the INT will be faster. If the tables are indexed properly and the tables are small, you might not see much of a slow down, but you'd have to try it to be sure. INTs are also easier to read, and communicate with other people. It's a lot simpler to say, "Can you look at record 1234?" instead of "Can you look at record 031E9502-E283-4F87-9049-CE0E5C76B658?"
If you are planning on merging database at some stage, ie for a multi-site replication type setup, Guid's will save a lot of pain. But other than that I find Int's easier.
If the data lives in a single database (as most data for the applications that we write in general does), then I use an IDENTITY. It's easy, intended to be used that way, doesn't fragment the clustered index and is more than enough. You'll run out of room at 2 billion some records (~ 4 billion if you use negative values), but you'd be toast anyway if you had that many records in one table, and then you have a data warehousing problem.
If the data lives in multiple, independent databases or interfaces with a third-party service, then I'll use the GUID that was likely already generated. A good example would be a UserProfiles table in the database that maps users in Active Directory to their user profiles in the application via their objectGUID that Active Directory assigned to them.
Some OSes don't generate GUIDs anymore based on unique hardware features (CPUID,MAC) because it made tracing users to easy (privacy concerns). This means the GUID uniqueness is often no longer as universal as many people think.
If you use some auto-id function of your database, the database could in theory make absolutely sure that there is no duplication.
I always think PK's should be numeric where possble. Dont forget having GUIDs as a PK will probably mean that they are also used in other tables as foriegn keys, so paging and index etc will be greater.
An INT is certainly much easier to read when debugging, and much smaller.
I would, however, use a GUID or similar as a license key for a product. You know it's going to be unique, and you know that it's not going to be sequential.
I think the database also matters. From a MySQL perspective - generally, the smaller the datatype the faster the performance.
It seems to hold true for int vs GUID too -
http://kccoder.com/mysql/uuid-vs-int-insert-performance/
I would use GUID as PK only if this key bounds to similar value. For example, user id (users in WinNT are describes with GUIDs), or user group id.
Another one example. If you develop distributed system for documents management and different parts of system in different places all over the world can create some documents. In such case I would use GUID, because it guaranties that 2 documents created in different parts of distributed system wouldn't have same Id.

Strategy for coping with database identity/autonumber maxing out

Autonumber fields (e.g. "identity" in SQL Server) are a common method for providing a unique key for a database table. However, given that they are quite common, at some point in the future we'll be dealing with the problem where they will start reaching their maximum value.
Does anyone know of or have a recommended strategy to avoid this scenario? I expect that many answers will suggest switching to guids, but given that this will take large amount of development (especially where many systems are integrated and share the value) is there another way? Are we heading in a direction where newer hardware/operating systems/databases will simply allow larger and larger values for integers?
If you really expect your IDs to run out, use bigint. For most practical purposes, it won't ever run out, and if it does, you should probably use a uniqueidentifier.
If you have 3 billion (that means a transaction/cycle on a 3.0GHz processor) transactions per second, it'll take about a century for bigint to run out (even if you put off a bit for the sign).
That said, once, "640K ought to be enough for anybody." :)
See these related questions:
Reset primary key (int as identity)
Is bigint large enough for an event log table?
The accepted/top voted answers pretty much cover it.
Is there a possibility to cycle through the numbers that have been deleted from the database? Or are most of the records still alive? Just a thought.
My other idea was Mehrdad's suggestion of switching to bigint
Identity columns are typically set to start at 1 and increment by +1. Negative values are just as valid as positive, thus doubling the pool of available identifiers.
If you may be getting such large data amounts that your IDs max out, you may also want to have support for replication so that you can have multiple somehow synchronized instances of your database.
For such cases, and for cases where you want to avoid having "guessable" IDs (web applications etc.), I'd suggest using Guids (Uniqueidentifiers) with a default of a new Guid as replacement for identity columns.
Because Guids are unique, they allow data to be synchronized properly, even if records were added concurrently to the system.

Resources