How to replace all non-sequential GUIDs with sequential ones? - sql-server

I have a brown-field SQL Server 2005 database that uses standard, unsorted GUIDs as the majority of the primary keys values and also in clustered indexes (which is bad for performance).
How should I go about changing these to sequential GUIDs? One of the challenges would be to replace all of the foreign key values as I change each the primary key.
Do you know of any tools or scripts to perform this type of conversion?

remember that you can only use the newsequentialid() function as a default
so create a new table with 2 columns. Insert the key from you original table into this one (leave the other column out it will fill itself)
join back to the original table and update the PK with the newsequantialid, if you have cascade update the FKs should update by themselves

How do you know it's bad for performance?
The advantage of GUIDs is that you don't have to worry about multiple processes creating records at the same time. It can simplify your code considerably, depending on the program.

SequentialGuids are best for performance when the Guid is the PK. This is the reason behind their existence.

Related

Do you need a primary key if duplicates are allowed?

This might be too subjective, but it's been puzzling me some time.
If you have a Fact table that allows duplicates with 10 dimensions that do not, do you really need a primary key?
Why Are There Duplicates?
It's a bit tricky, but ideally each duplicate is actually valid. There is just not a unique identifier to separate them from the source system recording the record. We don't own that system so there is no way to ever change it.
Data
The data is in batch and only include the previous days worth of records. Therefore, in the event of a republish. We just drop the entire days worth of records and republish the new day of records without the use of a primary key.
This is how I would fix bad data.
Generate A Primary Key Already
I can, but if it's never used or have anyway to validate if the duplicate is legit, why do it?
SQL Server database tables do not require a primary key.
A database engine may well create a primary key in the background though.
Yes, SQL Server don't need primary key. Mostly, it needs in CLUSTERED index. Because, if you have another NONCLUSTERED indexes on this table, every of them will use CLUSTERED index for pointing data. So, primary key is good example of clustered key. And if it's short, and you have another indexes - it's reason to create it.

SQL Server Performance Suggestion

I have been creating database tables using only a primary key of the datatype int and I have always had great performance but need to setup merge replication with updatable subscribers.
The tables use a typical primary key, data type int, and identity increment. Setting up merge replication, I have to add the rowguid to all tables with a newsequentialid() function for the default value. I noticed that the rowguid has indexable on and was wondering if I needed the primary key anymore?
Is it okay to have 2 indexes, the primary key int and the rowguid? What is the best layout for a merge replication table? Do I keep the int id for easy row referencing and just remove the index but keep the primary key? Not sure what route to take, Thanks.
Remember that if you remove the int id column and replace it with a GUID, you may need to rework a good deal of your data and your queries. And do you really want to do queries like:
select * from orders where customer_id = '2053995D-4EFE-41C0-8A04-00009890024A'
Remember if your ids are exposed to any users (often in the case of a customer because the customer table often has no natural key since names are not unique), they will find the guid daunting for doing research.
There is nothing wrong in an existing system with having both. In a new system, you could plan to not use the ints, but there is a great risk of introducing bugs if you try to remove them in a system already using them.
The only downside of replacing the integer primary key with the guid (that I know of) is that GUIDs are larger, so the btree (index space used) will be larger and if you have foreign keys to this table (which you'd also need to change) a lot more space may end up being used across (potentially) many tables.

Primary Key: Slow Inserts?

Defining a column to be a primary in table on SQL Server - will this make inserts slower?
I ask because I understand this is the case for indexes.
The table has millions of records.
No, not necessarily! Sounds counter-intuitive, but read this quote from Kim Tripp's blog post:
Inserts are faster in a clustered
table (but only in the "right"
clustered table) than compared to a
heap. The primary problem here is that
lookups in the IAM/PFS to determine
the insert location in a heap are
slower than in a clustered table
(where insert location is known,
defined by the clustered key). Inserts
are faster when inserted into a table
where order is defined (CL) and where
that order is ever-increasing.
So actually, having a good clustered index (e.g. on a INT IDENTITY column, if ever possible) does speed things up - even insert, updates and deletes!
Primary keys are automatically indexed, clustered if possible and failing that non-clustered.
So in that sense inserts are slightly affected, but of course having no primary key would usually be much much worse, assuming the table needs a primary key.
First measure, identify a problem and then try to optimize. Optimizing away primary keys is a very bad idea in general.
Not enough to create a perceptual performance hit, and the benefits far outweigh the very minor perforamnce issues. There are very very few scenarios where you should not put a primary key on a table.
The really quick answer is:
Yes.
Primary keys are always indexed (and SQL will attempt a clustered index). Indexes make inserts slower, clustered indexes even more so.
Depending on what your table is used for, you may have a couple of options.
If you do a lot of bulk inserts then reads, you can remove the primary key, insert into a heap (if you have SQL 2008 this can be minimally logged to run even faster) then reassign the key and wait for the index to run.
As an addendum to that, you can also insert using an ORDER BY clause which will keep the inserted rows in correct order for the clustered index. This will really only help if you are inserting millions of rows at once from a source that is already ordered, though.
Yes, adding a primary key to a table will slow inserts (which is OK, because not adding a primary key to your table will speed up your application's eventual catastrophic failure).
If what you're doing is creating a new table and then inserting millions of records into it, there is nothing wrong with initially creating the table without a primary key, inserting all the records, and then creating the primary key. Or use an alternative tool to perform a bulk insert.
Yes, inserts are slowed, especially with several clients doing inserts simultaneously, and more so if your key is sequentially increasing (all inserts occur at the right-most nodes of the index tree, in most database implementations, or at the last page of the table for e.g. clustered SQL Server indices -- both of which scenarios cause resource contention).
That said, SELECTs using the primary key are speeded up quite a bit, and the integrity of your key is guaranteed. Do the right thing first (define primary keys everywhere). Second, measure to see if you cannot meet your performance targets, and whether this is caused by your data integrity constraints. Only then consider workarounds.
No, not necessarily
Regardless, that is not why you would define a primary key on a table.
You define a primary key when it is REQUIRED by the domain model.

Switching to sequential (comb) guids - what about existing data?

We have a database with 500+ tables, in which almost all the tables have a clustered PK that is of datatype guid (uniqueidentifier).
We are in the process of testing a switch from "normal" "random" guids generated through .NETs Guid.NewGuid() method to sequential guids generated through the NHibernate guid.comb algorithm. This seems to be working well, but what about clients that already have millions of rows with "random" primary key values?
Will they benefit from the fact that new ids generated from now on will be sequential?
Could/should anything be done to their existing data?
Thanks in advance for any pointers on this.
You could do this, but I'm not sure you would want to. I dont see any benefit in using sequential guids, in fact using guids is not recommended as a primary key unless there are distributed/replication reasons involved. Are you using a clustered index?
Having said that if you go ahead, I recommend loading a table with values from your algorithm first.
You are going to have hassles with foreign keys. You will need to associate the old and new guids in the aformentioned table, drop the foreign keys, perform a transactional update, then reapply the foreign keys.
I dont think it is worth the hassle unless you were moving away from guids altogether to say an integer based system.
It depends whether the tables are clustered on the primary index or on another index. For instance, if you are creating large amounts of new records in a table with a GUID PK and a creation date, it usually makes sense to cluster by the creation date in order to optimize the insert operation.
On the other hand, depending on the queries done, a cluster on the GUID may be better, in which case using sequential GUIDs can help with the insert performance. I'd say that it isn't possible to give a final answer to your question without in-depth knowledge of the usage.
I'm facing a similar issue, I think it would be possible to update existing data by writing an application to update your existing keys using the NHibernate guid.comb algorithm. To propogate the new keys to related foreign key tables maybe it would be possible to temporarily cascade updates? Doing this through .NET code would be slower than an SQL script, another option might be to duplicate the guid.comb logic in SQL but not sure if this is possible.
If you choose to retain the existing data, using the guid.comb algorithm should have some performance improvement, there will still be page splitting when inserts occur but because new guids are sequential instead of totally random this will be at least somewhat reduced. Another option to consider would be to remove the clustered index on your GUID primary key, although I'm not sure how much existing query performance will be impacted.

When having an identity column is not a good idea?

In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.

Resources