Unique identifier as primary key when syncing databases - sql-server

I have spent the best part of a day researching and testing different methods of syncing a SQL server database with CoreData on the Mac. I have tested both INT's and GUID's (sequential GUID's also) as my primary keys and although GUID's are by far the worst in terms of performance I can't see no other way of ensuring uniqueness across systems.
Is using GUID's for primary keys the wrong way to go when syncing data between platforms? I find it hard to believe that companies use GUID's when syncing, but most articles I read on the subject seem to point to just that. If developers are using GUID's does anybody know how performance can be improved? I tried using a GUID as a primary key with a non clustered index and created a date field as my clustered index with no great performance improvement.
Any help would be much appreciated, especially if you have tackled a similar problem.

GUIDs make syncing a lot easier. Sequential guids will alleviate the fragmentation issue strongly, leaving only the 16 byte column size as the major issue.
As long as you ensure you have another sequential and narrow column as your clustered key, you'll save a lot of space for your nonclustered indexes - it seems like you already know this.
Assuming you're not dealing with GBs of data, performance shouldn't be that affected by a GUID in this case, given you've already handled the GUID column with proper care.
If you only need to sync two systems, I've previously created systems where A would use an identity(-1,-1) as the primary key where the other system used an identity(1,1) as the primary key. That ensure easy syncing while keeping the primary key nice and narrow. Won't work for more than two systems however.

Agreed, when you use GUIDs you may come across huge fragmentation of the indexes that use it as the key. Other ways of ensuring uniqueness across systems is to
use identity columns and seed them to different non-overlapping ranges:
1-100 million server A
100million1 -200 million server B
etc
or to use composite keys (identity int + location code) to distinguish original locations of data:
Three different rows:
1 AB
1 BZ
1 XV
Regards
Piotr

Related

What options are there for transferring one particular webapp user's data onto another DB instance without suffering PK collision?

EDITED
When we first architected our web app years ago, we chose auto-increment int for all of our users' data. However, we are now getting burned by how hard it is to transfer specific user's data (multiple tables with one-to-many relationships) to another non-empty database instance (with the same table strictures).
While SET IDENTITY_INSERT table ON | OFF may work for some tables, with our current architecture we will still run into problem because certain 'many' in a 'one-to-many' relation may collide with the destination DB.
Inspired by Pam Lahoud's answer below, I started researching on non-clustered PK and PK alternatives. Then I come across Selecting an Appropriate Primary Key for a Distributed Environment from MSDN, and "Keys That Include a Node Identifier" caught my eye. Anyone has experience with this kind of architecture?
GUIDs as primary keys are awesome, GUIDs as clustered index keys are not so awesome. While the PK does default to be clustered, it doesn't necessarily have to be. If there is another column on the table that would make sense to be clustered, you might consider converting over to non-clustered GUID primary keys and clustering on some other field.
If the PKs on your table are used frequently for filtering and joining, it probably still makes sense for them to be clustered even if they are GUIDs. Using newsequentialid() will get around most of the problems that are caused by GUID clustered index keys - namely logical index fragmentation, page splits and low page density. You still have the issue that GUIDs are a large data type and therefore all your indexes (both clustered and non-clustered since they also contain the clustered index key) will be somewhat larger, but I don't think that's necessarily a deal breaker.
The only other solution I can think of other than converting to GUIDs would be to specify identity ranges on each of your databases and add constraints to ensure that there is no overlap in ranges between them. This of course wouldn't work for existing data, but would prevent the problem from happening in the future as new data being inserted should be unique across your farm.
As with anything in SQL Server, there are very few "always" or "never" rules, GUIDs as primary keys and/or clustered index keys is one of those "it depends" rules, I think in this case the GUID PK might be the right solution.

What are the best practices for using a GUID as a primary key, specifically regarding performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)
I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.
I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).
What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.
Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:
CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )
ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)
CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)
Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED
This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!
I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.
This is my setup for using GUIDs:
PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.
PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.
Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.
Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.
Here is my Email table...
CREATE TABLE [Core].[Email] (
[EmailID] UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,
[EmailAddress] NVARCHAR (50) CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,
[CreatedDate] DATETIME CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,
[ClusterID] INT NOT NULL IDENTITY,
CONSTRAINT [PK_Email] PRIMARY KEY NonCLUSTERED ([EmailID] ASC)
);
GO
CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)
I am currently developing an web application with EF Core and here is the pattern I use:
All my classes (tables) have an int PK and FK.
I then have an additional column of type Guid (generated by the C# constructor) with a non clustered index on it.
All the joins of tables within EF are managed through the int keys while all the access from outside (controllers) are done with the Guids.
This solution allows to not show the int keys on URLs but keep the model tidy and fast.
This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).
https://web.archive.org/web/20120812080710/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.
Well, if your data never reach millions of rows, you are good. If you ask me, i never use GUID as database identity column of any type, including PK even if you force me to design with a shotgun at the head.
Using GUID as primary key is a definitive scaling stopper, and a critical one.
I recommend you check database identity and sequence option. Sequence is table independent and may provide a solution for your needs(MS SQL has sequences).
If your tables start reaching some dozens of millions of rows the most, e.g. 50 million you will not be able read/write information at acceptable timings and even standard database index maintenance would turn impossible.
Then you need to use partitioning, and be scalable up to half a billion or even 1-2 billion rows. Adding partitioning on the way is not the easiest thing, all read/write statements must include partition column (full app changes!).
These number of course (50 million and 500 million) are for a light selecting useage. If you need to select information in a complex way and/or have lots of inserts/updates/deletes, those could even be 1-2 millions and 50 millions instead, for a very demanding system. If you also add factors like full recovery model, high availability and no maintenance window, common for modern systems, things become extremely ugly.
Note at this point that 2 billion is int limit that looks bad, but int is 4 times smaller and is a sequential type of data, small size and sequential type are the #1 factor for database scalability. And you can use big int which is just twice smaller but still sequential, sequential is what is really deadly important - even more important than size - when to comes to many millions or few billions of rows.
If GUID is also clustered, things are much worst. Just inserting a new row will be actually stored randomly everywhere in physical position.
Even been just a column, not PK or PK part, just indexing it is trouble. From fragmentation perspective.
Having a guid column is perfectly ok like any varchar column as long as you do not use it as PK part and in general as a key column to join tables. Your database must have its own PK elements, filtering and joining data using them - filtering also by a GUID afterwards is perfectly ok.
Having sequential ID's makes it a LOT easier for a hacker or data miner to compromise your site and data. Keep that in mind when choosing a PK for a website.
If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it.
Another reason not to expose an Id in the user interface is that a competitor can see your Id incrementing over a day or other period and so deduce the volume of business you are doing.
Most of the times it should not be used as the primary key for a table because it really hit the performance of the database.
useful links regarding GUID impact on performance and as a primary key.
https://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/

SQL Server performance difference with single or multi column primary key?

Is there any difference in performance (in terms of inserting/updating & querying) a table if the primary key is a single column (e.g., a GUID generated for every row) or multiple columns (e.g., a foreign key GUID + an offset number)?
I would assume querying speeds should be quicker if anything with multi-column primary keys, however I would imagine inserting would be slower due to a slightly more complicated unique check? I also imagine the data types of a multi-column primary key could also matter (e.g., if one of the columns was a DateTime type it would add complexity). These are just my thoughts to invoke answers & discussion (hopefully!) and are not fact based.
I realise there are some other questions covering this topic, but I'm wondering about performance impacts rather than management/business concerns.
You will be affected more by (each) component of the key being (a) variable length and (b) the width [wide instead of narrow columns], than the number of components in the key. Unless MS have broken it again in the latest release (they broke Heaps in 2005). Datatype does not slow it down; the width, and particularly variable length (any datatype) does. Note that a fixed len column is made variable if it is set to Nullable. Variable len columns in indices is bad news, because a bit of "unpacking" has to be performed on every access, to get at the data.
Obviously, keep indexed columns as narrow as possible, using fixed, and not Nullable columns only.
In terms of number of columns in a compound key, sure one column is faster than seven, but not that much: three fat wide variable columns are much slower than seven thin fixed columns.
GUID is of course a very fat key; GUID plus anything else is very very fat; GUID Nullable is Guiness material. Unfortunately it is the knee-jerk reaction to solving the IDENTITY problem, which in turn is a consequence of not having chosen good natural relational keys. So you are best advised to fix the real problem at the source, and choose good natural keys; avoid IDENTITY; avoid GUID.
Experience and performance tuning, not conjecture.
It depends on your access patterns, read/write ratio and whether (possibly most importantly) the clustered index is defined on the Primary Key.
Rule of thumb is make your primary key as small as possible (32 bit int) and define the clustered index on a monotonically increasing key (think IDENTITY) where possible, unless you have range searches that form a large proportion of the queries against that table.
If your application is write intensive, and you define the clustered index on the GUID column you should note:
All non-clustered indexes will
contain the clustered index key and will therefore be larger. This may have a negative effect of performance if there are many NC indexes.
Unless you are using an 'ordered'
GUID (such as a COMB or using
NEWSEQUENTIALID()), your inserts
will fragment the index over time. This means
you need a regular index rebuild and
possibly increasing the amount of
free space left in pages (fill
factor)
Because there are many factors at work (hardware, access patterns, data size), I suggest you run some tests and benchmark your particular circumstances..
It depends on the indexing and storage in each case. All other things being equal, the choice of primary key is irrelevant as far as performance is concerned. The choice of indexes and other storage options would be the deciding factor.
If your situation is going to be geared towards a higher number of inserts, then the smaller footprint possible, the better.
There are two things you need to separate, the concept of the primary key at the database level, and the concept of the key your application uses.
Why do you need a GUID? Are you going to be inserting into multiple database server, and then combining the information into one centralized database?
If that is the case then my recommendation is an identity followed by a guid. Clustered index on the identity, and Unique Non clustered on the GUID. If you use the GUID as a Clustered index, then your data inserts will be all over the place. Meaning your data will not be inserted sequentially, and this causes performance problems as your system will be inserting and moving pages around randomly.
Having your data inserted nice in an ordered faction, thanks to the identity, is the way to go. You can leave the sorting to the index structure( the nonclusered unique containing the GUID), which is a much more efficient structure to sort than using the table data.

How common an anti-pattern is representing GUID primary keys using character data?

I was pondering GUIDs as primary keys recently, and was reminded of the most egregious misuse of them I've ever encountered:
This database contained a lot of Entity-Detail parent-child relationships, like Receipt, which had LineItems. Most of the Detail tables (LineItem in this case) used GUID primary keys. But instead of being stored using MSSQL's uniqueidentifier type, they were stored as 38-character strings, in the form '{00000000-0000-0000-0000-000000000000}'. Oh, and they were almost always in nvarchar (Unicode) columns, clocking in at 76 bytes a piece (instead of 16 bytes for a uniqueidentifier).
And how often were these fields joined on? In almost every single query in the system. Hundreds of client databases, millions of records fitting this profile. Bad.
The system did not, to the best of my memory, precede SQL Server 7.0, when the uniqueidentifier was introduced. It was just a sheer failure of knowledge / research that led to this problem.
I have two questions:
How common, in your experience, is this anti-pattern?
It seems obvious that a join on a 76-byte Unicode string would be dramatically slower than a join on a 16-byte binary number, with indexes or without. But can anyone provide an idea of just what a performance hit this might entail? Assume you index the join columns in either scenario.
I think the problem is not so much the inherent speed difference between joining on 76 byte keys and 16 byte keys but more on:
How many rows can you pack into each 8k page (where you get more page splits / more fragmented indexes / worse performance)....
Also -- you didn't mention if those pretend GUID's were sequential or not. If they were part of the primary key and that KEY was clustered then every insert could have potentially reorganised the complete btree of the table........
Also any nonclustered indexes you have on the table contain the primary key (so they can do lookups on querys not 100% satisfied by the nonclustered index). So your nonclustered indexes are going to be much, much bigger than if they were on a table with a UNIQUEIDENTIFIER type.
I haven't seen the GUID's modelled as strings in any company I've worked for but I have seen a few tables where the pk was clustered and a GUID was chosen for no particular reason. Worked fine for small datasets and then..... performance problems in production.

Is it a good idea to use rowguid as unique key in database design?

SQL Server provides the type [rowguid]. I like to use this as unique primary key, to identify a row for update. The benefit shows up if you dump the table and reload it, no mess with SerialNo (identity) columns.
In the special case of distributed databases like offline copies on notebooks or something like that, nothing else works.
What do you think? Too much overhead?
As a primary key in the logical sense (uniquely identifying your rows) - yes, absolutely, makes total sense.
BUT: in SQL Server, the primary key is by default also the clustering key on your table, and using a ROWGUID as the clustering key is a really really bad idea. See Kimberly Tripp's excellent GUIDs as a PRIMARY and/or the clustering key article for in-depth reasons why not to use GUIDs for clustering.
Since the GUID is by definition random, you'll have a horrible index fragmentation and thus really really bad performance on insert, update, delete and select statements.
Also, since the clustering key is being added to each and every field of each and every non-clustered index on your table, you're wasting a lot of space - both on disk as well as in server RAM - when using 16-byte GUID vs. 4-byte INT.
So: yes, as a primary key, a ROWGUID has its merits - but if you do use it, definitely avoid using that column as your clustering key in the table! Use a INT IDENTITY() or something similar for that.
For a clustering key, ideally you should look for four features:
stable (never changing)
unique
as small as possible
ever-increasing
INT IDENTITY() ideally suits that need. And yes - the clustering key must be unique since it's used to physically locate a row in the table - if you pick a column that can't be guaranteed to be unique, SQL Server will actually add a four-byte uniqueifier to your clustering key - again, not something you want to have....
Check out The Clustered Index Debate Continues - another wonderful and insightful article by Kim Tripp (the "Queen of SQL Server Indexing") in which she explains all these requirements very nicely and thoroughly.
MArc
The problem with rowguid is that if you use it for your clustered index you end up constantly re-calculating your table pages for record inserts. A sequential guid ( NEWSEQUENTIALID() ) often works better.
Our offline application is used in branch offices and we have a central database in our main office. To synchronize the database into central database we have used rowguid column in all tables. May be there are better solutions but it is easier for us. We have not faced any major problem till date in last 3 years.
Contrary to the accepted answer, the uniqueidentifier datatype in SQL Server is indeed a good candidate for a primary clustering key; so long as you keep it sequential.
This is easily accomplished using (newsequentialid()) as the default value for the column.
If you actually read Kimberly Tripp's article you will find that sequentially generated GUIDs are actually a good candidate for primary clustering keys in terms of fragmentation and the only downside is size.
If you have large rows with few indexes, the extra few bytes in a GUID may be negligible. Sure the issue compounds if you have short rows with numerous indexes, but this is something you have to weigh up depending on your own situation.
Using sequential uniqueidentifiers makes a lot of sense when you're going to use merge replication, especially when dealing with identity seeding and the woes that ensue.
Server calss storage isn't cheap, but I'd rather have a database that uses a bit more space than one that screeches to a halt when your automatically assigned identity ranges overlap.

Resources