Switching to sequential (comb) guids - what about existing data? - sql-server

We have a database with 500+ tables, in which almost all the tables have a clustered PK that is of datatype guid (uniqueidentifier).
We are in the process of testing a switch from "normal" "random" guids generated through .NETs Guid.NewGuid() method to sequential guids generated through the NHibernate guid.comb algorithm. This seems to be working well, but what about clients that already have millions of rows with "random" primary key values?
Will they benefit from the fact that new ids generated from now on will be sequential?
Could/should anything be done to their existing data?
Thanks in advance for any pointers on this.

You could do this, but I'm not sure you would want to. I dont see any benefit in using sequential guids, in fact using guids is not recommended as a primary key unless there are distributed/replication reasons involved. Are you using a clustered index?
Having said that if you go ahead, I recommend loading a table with values from your algorithm first.
You are going to have hassles with foreign keys. You will need to associate the old and new guids in the aformentioned table, drop the foreign keys, perform a transactional update, then reapply the foreign keys.
I dont think it is worth the hassle unless you were moving away from guids altogether to say an integer based system.

It depends whether the tables are clustered on the primary index or on another index. For instance, if you are creating large amounts of new records in a table with a GUID PK and a creation date, it usually makes sense to cluster by the creation date in order to optimize the insert operation.
On the other hand, depending on the queries done, a cluster on the GUID may be better, in which case using sequential GUIDs can help with the insert performance. I'd say that it isn't possible to give a final answer to your question without in-depth knowledge of the usage.

I'm facing a similar issue, I think it would be possible to update existing data by writing an application to update your existing keys using the NHibernate guid.comb algorithm. To propogate the new keys to related foreign key tables maybe it would be possible to temporarily cascade updates? Doing this through .NET code would be slower than an SQL script, another option might be to duplicate the guid.comb logic in SQL but not sure if this is possible.
If you choose to retain the existing data, using the guid.comb algorithm should have some performance improvement, there will still be page splitting when inserts occur but because new guids are sequential instead of totally random this will be at least somewhat reduced. Another option to consider would be to remove the clustered index on your GUID primary key, although I'm not sure how much existing query performance will be impacted.

Related

Use of non-clustered index on guid type column in SQL Server

I would like optimize the performance of a database that my team is using for an application.
I have been looking for areas to add foreign keys, and in turn index those columns to improve the performance of joins. However, many of our tables are joined on an id that is a GUID type, generated upon insertion of an item, and the data associated with that item in other tables is generally has column item_id containing the GUID.
I have read that adding clustered indexes to GUID type columns is a very bad decision because the index will need to be constantly reconstructed in order to be effective. However, I was wondering, is there any detriment to utilizing a non-clustered index in the scenario described above? Or is it reasonable to assume that it would help performance? I can provide more information if needed.
An index on a <anytype> is by far the best option you have to improve joins and singleton lookups. Lacking this index the query will always have to scan the entire table end-to-end with (often) abysmal performance results and concurrency gone down the drain.
It is true that uniqueidentifier makes poor choice for indexes for the reasons you mention, but by no means does that implies that you should not create these indexes. Changing the data type to INT or BIGINT would be advisable, if possible. Using NEWSEQUENTIALID() or UuidCreateSequential to generate them would help with fragmentation issues. If all alternatives fail you may have to do index maintenance (Rebuild, reorganize) operations more often than for other indexes. But by no means do any of these drawbacks outweigh the benefit of having the index in the first place!
Two performance:
- insert
- select
An index should improve select
An index will slow slow down insert.
If the inserts are in order the index does not fragment.
If the inserts are not in order the index will fragment.
Index fragmentation slows down both insert and select.
Via maintenance can defragment the index.
Adding an non-clustered index to the column that references a FK will help the joins.
Since that column is most likely not ordered that fact it is a GUID is of no loss.
On the FK table itself is where GUID is not a good candidate for a PK (clustered index).
With GUID as PK that index fragments on insert.
Int or sequential ID are better candidates as they would not fragment the PK on insert.
But no big deal just defragment those tables.
Yes, you are better off changing the Guid index from clustered to non-clustered. Guid can still be primary key and you don't need to change your query/source code. No reordering of data and increased performance.
In databases like SQL Azure it is mandatory to have a clustered index. So you could use a date/datetime field. Creating a additional int-identity/autoincrement column is unnecessary as some developers in one team tend to use those and others GUID. Resulting in an inconsistent application. So keep only GUID.. full stop!
Talking about sequential Guids, I think Guids are better created from code than from database. Modern DALs and repository patterns do not prefer dependencies on DB for CRUD. e.g. scenario: linq query and automated builds with unit testing with out DB dependency. And creating a sequential guid ourselves is not a good idea(atleast for me). So Guid as primary Key with a non-clustered index is the best option there is.
I have backing from Microsoft on the non-clustered subject http://blogs.msdn.com/b/sqlazure/archive/2010/05/05/10007304.aspx
Edited: Backing is gone ("No Resource Found")
It would usually help performance. But you may wish to create the index with a fillfactor of less than 100% such that the inevitable page-splits don't have to happen quite so often. Regular maintenance on the index would certainly be a plus.
Yes, a non-clustered index would be ideal for your situation. The underlying is a B-tree, like the clustered index, but the underlying data on the table is not sorted, so the problems with the non-sequential nature of the GUID does not exist. The NC index exist separately from the table.
Be careful to not add too many non-clustered indices though. Optimize only where you need to. Run the profiler to see which queries are taking a long time, and optimize only those. Additionally, be sure to set the fill factor to a value <50% unless the database rarely gets any updates, or space is a constraint.
Relevant MSDN: http://msdn.microsoft.com/en-us/library/ms177484(v=sql.105).aspx

Any special considerations when adding lots of missing foreign key indexes in an existing database?

I'm working on a system backed up by a rather large SQL Server 2008 database, about 250 tables. After doing some performance optimization, I discovered that in many of the tables there were missing indexes on the foreign keys. After going through all the tables in more details, I identified about 150 foreign keys with potential missing indexes.
I know that it is generally good practice to put an index on every foreign key, and I also know that indexes aren't automatically created for foreign keys. I suspect that people not thinking about (or being aware of...) the latter is the reason why the database has ended up in its current condition.
But since I'm not even close to an expert on database optimization, I thought I'd fire away a control question before starting to add all those indexes:
Question:
Are there any special considerations to keep in mind when adding such a large number of indexes to an existing database?
The only thing I can think of myself, is that for every index you can get a potential performance penalty for inserts and updates. But for indexes strictly on foreign key columns, I picture that that's not a major problem. Also, our database isn't really insert/update intensive, another reason for that not being an issue.
I could perhaps also mention that we use NHibernate as our ORM layer, which I guess is yet another reason for having good indexes on foreign keys (since we're accessing lots of objects through foreign key properties).
I don't think there's anything special apart from the obvious points to consider:
More disk space required for the indexes
Performance impact, but this is something that should be measured rather than guessed (even a database that is insert/update intensive usually has more reads than writes; the engine has to find a row before it can be updated)
Extend your DDL generation and management process to ensure that all FK indexes are scripted by default from now on
If the constraints already exist and only the indexes are missing, then you don't have to worry about the biggest potential issue, which is invalid data that you need to review and fix. But I understand from your question that the FKs themselves are already there.
New plans (whilst hopefully faster due to the new indexes) might offer new opportunities for deadlocks to occur - where previously, all of the activity for a statement/transaction was only against the clustered index, it can now fulfil part of the query using a lock against a new index instead.
Just make sure you test with realistic data volumes and workloads.

what type of database record id to use: long or guid?

In recent years I was using MSSQL databases, and all unique records in tables has the ID column type of bigint (long). It is autoincrementing and generally - works fine.
Currently I am observing people prefer to use GUIDs for record's identity.
Does it make sense to swap bigint to guid for unique record id?
I think it doesn't make sense as generating bigint as well as sorting would be always faster than guid, but... some troubles come when using two (or more) separated instances of application and database and keep them in sync, so you have to manage id pools between sql servers (for example: sql1 uses id's from 100 to 200, sql2 uses id's from 201 to 300) - this is a thin ice.
With guid id, you don't care about id pools.
What is your advice for my mirrored application (and db): stay with traditional ID's or move to GUIDs?
Thanks in advance for your reply!
guids have the
Advantages:
Being able to create them offline from the database without worrying about collisions.
You're never going to run out of them
Disadvantages:
Sequential inserts can perform poorly (especially on clustered indexes).
Sequential Guids fix this
Take up more space per row
creating one cleanly isn't cheap
but if the clients are generating them this is actually no problem
The column should still have a unique constraint (either as the PK or as a separate constraint if it is part of some other relationship) since there is nothing stopping someone supplying the GUID by hand and accidentally/deliberately violating uniqueness.
If the space doesn't bother you and your performance if not significantly impacted they make a lot of problems go away. The decision is inevitably specific to the individual needs of the application.
I use GUIDs in any scenario that involves either replication or client-side ID generation. It's just so much easier to manage identity in either of those situations with a GUID.
For two-tier scenarios like a web application talking directly to the database, or for servers that don't need to replicate (or perhaps only need to replicate in one direction, pub/sub style) then I think an auto-incrementing ID column is just fine.
As for whether to stay with autoincs or move to GUIDs ... it's one thing to advocate GUIDs in a green-field application where you can make all these decisions up front. It's another to advise someone to migrate an existing database. It might be more pain than it's worth.
GUIDs have issues with performance and concurrency when page splits occur. INTs can run page fill at 100% - only added at one end, GUIDS add everywhere so you probably have to run a lower fill - which wastes space throughout the index.
GUIDS can be allocated in the application, so the App can know the ID of the record it will have created, which can be handy; but, technically, it is possible for duplicate GUIDs to be generated (long odds, but at least put a Unique Index on GUID columns)
I agree for merging databases its easier. But for me a straight INT is better, and then live with the hassle of sorting out how to merge DBs when/if it is actually needed.
If your data move around often, then GUID is the best one for the Key of the table.
If you really care about the performance, just stick to int or bigint
If you want to leverage both of above, use int or bigint as the key of the table and each row can have a rowguid column so that the data can also be moved around easily without losing integrity.
If the ids are going to be displayed in the querystring, use Guids, otherwise use long as a rule.

Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?

For SQL server is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?
That depends on what you're doing:
If speed is the primary concern then a plain old int is probably big enough.
If you really will have more than 2 billion (with a B ;) ) records, then use bigint or a sequential guid.
If you need to be able to easily synchronize with records created remotely, then Guid is really great.
Update
Some additional (less-obvious) notes on Guids:
They can be hard on indexes, and that cuts to the core of database performance
You can use sequential guids to get back some of the indexing performance, but give up some of the randomness used in point two.
Guids can be hard to debug by hand (where id='xxx-xxx-xxxxx'), but you get some of that back via sequential guids as well (where id='xxx-xxx' + '123').
For the same reason, Guids can make ID-based security attacks more difficult- but not impossible. (You can't just type 'http://example.com?userid=xxxx' and expect to get a result for someone else's account).
In general I'd recommend a BIGINT over a GUID (as guids are big and slow), but the question is, do you even need that? (I.e. are you doing replication?)
If you're expecting less than 2 billion rows, the traditional INT will be fine.
Are you doing replication or do you have sales people who run disconnected databses that need to merge, use a GUID. Otherwise I'd go for an int or bigint. They are far easier to deal with in the long run.
Depends no what you need. DB Performance would gain from integer while GUIDs are useful for replication and not requiring to hear back from DB what identity has been created, i.e. code could create GUID identity before inserting into row.
If you're planning on using merge replication then a ROWGUIDCOL is beneficial to performance (see here for info). Otherwise we need more info about what your definition of 'better' is; better for what?
Unless you have a real need for a GUID, such as being able to generate keys anywhere and not just on the server, then I would stick with using INTEGER-based keys. GUIDs are expensive to create and make it harder to actually look at the data. Plus, have you ever tried to type a GUID in an SQL query? It's painful!
There can be few more aspects or requirements to use GUID.
If the primary key is of any numeric type (Int, BigInt or any other), then either you need to make it Identity column, or you need to check the last saved value in the table.
And in that case, if the record in foreign table is saved as transaction, then it would be difficult to get the last identity value of primary key. Like if IDENT_CURRENT is used, then will be again effect performance while saving record in foreign key.
So in case of saving the records as for transactions, then it would be convenient to firstly generate Guid for primary key, and then save the generated key (Guid) in primary and foreign table(s).
It really depends on whether or not the information coming in is somehow sequential. I highly recommend for things such as users that a GUID might be better. But for sequential data, such as orders or other things that need to be easily sortable that a bigint may well be a better solution as it will be indexed and provide fast sorting without the cost of another index.
It really depends whether you're expecting to have replication in the picture. Replication requires a row UUID, so if you're planning on doing that you may as well do it up front.
I'm with Andrew Rollings.
Now you could argue space efficiency. An int is what, 8 bytes max? A guid is going to much longer.
But I have two main reasons for preference: readability and access time. Numbers are easier for me than GUIDs (since I can always find the next/previous record easily).
As for access time, note that some DBs can start to have BIG problems with GUIDs. I know this is the case with MySQL (MySQL InnoDB Primary Key Choice: GUID/UUID vs Integer Insert Performance). This may not be much of a problem with SQL Server, but it's something to watch out for.
I'd say stick with INT or BIGINT. The only time I would think you'd want the GUID is when you are going to give them out and don't want people to be able to guess the IDs of other records for security reasons.

Tables with no Primary Key

I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.

Resources