Choice of primary key type - sql-server

I have a table that potentially will have high number of inserts per second, and I'm trying to choose a type of primary key I want to use. For illustrative purposes let's say, it's users table. I am trying to chose between using GUID and BIGINT as primary key and ultimately as UserID across the app. If I use GUID, I save a trip to database to generate a new ID, but GUID is not "user-friendly" and it's not possible to partition table by this ID (which I'm planning to do). Using BIGINT is much more convenient, but generating it is a problem - I can't use IDENTITY (there is a reason fro that), so my only choice is to have some helper table that would contain last used ID and then I call this stored proc:
create proc GetNewID #ID BIGINT OUTPUT
as
begin
update HelperIDTable set #ID=id, id = id + 1
end
to get the new id. But then this helper table is an obvious bottleneck and I'm concerned with how many updates per second it can do.
I really like the idea of using BIGINT as pk, but the bottleneck problem concerns me - is there a way to roughly estimate how many id's it could produce per second? I realize it highly depends on hardware, but are there any physical limitations and what degree are we looking at? 100's/sec? 1000's/sec?
Any ideas on how to approach the problem are highly appreciated! This problem doesn't let me sleep for many night now!
Thanks!
Andrey

GUID seem to be a natural choice - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table - the single value that uniquely identifies the row in the database.
What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
So to sum it up: unless you have a really good reason, I would always recommend a INT IDENTITY field as the primary / clustered key on your table.
Marc

I try to use GUID PKs for all tables except small lookup tables. The GUID concept ensures that the identity of the object can safely be created in memeory without a roundtrip to the database and saving later without changing the identity.
When you need a "human readable" id you can use an auto increment int when saved. For partitioning you could also create the BIGINTs later by a database schedule for many users in one shot.

Do you want a primary key, for business reasons, or a clustred key, for storage concerns?
See stackoverflow.com/questions/1151625/int-vs-unique-identifier-for-id-field-in-database for a more elaborate post on the topic of PK vs. clustered key.
You really have to elaborate why can't you use IDENTITY. Generating the IDs manually, and specially on the server with an extra rountrip and an update just to generate each ID for the insert it won't scale. You'd be lucky to reach lower 100s per second. The problem is not just the rountrip and update time, but primarily from the interaction of ID generation update with insert batching: the insert batching transaction will serialize ID generation. The woraround is to separate the ID generation on separate session so it can autocommit, but then the insert batching is pointless because the ID genartion is not batched: it has to wait for log flush after each ID genrated in order to commit. Compared to this uuid will be running circles around your manual ID generation. But uuid are horrible choice for clustred key because of fragmentation.

try to hit your db with a script, perhaps with the use of jmeter to simulate concurrent hits. Perhaps you can then just measure yourself how much load you can handle. Also your DB could cause a bottle neck. Which one is it? I would prefure PostgreSQL for heavy load, like yahoo and skype also do

An idea that requires serious testing: try creating (inserting) new rows in batches -- say 1000 (10,000? 1M?) a time. You could have a master (aka bottleneck) table listing the next one to use, or you might have a query that does something like
select min(id) where (name = '')
Generate a fresh batch of emtpy rows in the morning, every hour, or whenever you're down to a certain number of free ones. This only addresses the issue of generating new IDs, but if that's the main bottleneck it might help.
A table partitioning option: Assuming a bigint ID column, how are you defining the partition? If you are allowing for 1G rows per day, you could set up the new partition in the evening (day1 = 1,000,000,000 through 1,999,999,999, day2 = 2,000,000,000 through 2,999,999,999, etc.) and then swap it in when it's ready. You are of course limited to 1000 partitions, so with bigints you'll run out of partitions before you run out of IDs.

Related

optimizing sql server database

My database has one very large table with over 2 billion rows with 3 columns.
Id(uniqueidentity), Type(int, between 0-10. 0 = most used. 10 = least used), Data(Binary data between 1-10MB)
What are some ways I can optimize this database? (primarily select queries)
*Note: I might add a few more columns to this table later (eg: location, date...)
Assuming that the id column is the clustered index key, and assuming that by uniqueidentity you mean uniqueidentifier:
do you need the uniqueidentifier type? Why?
What other alternatives have you considered?
Do you populate the data using sequential GUIDs or not?
GUIDs are a notoriously poor choise for clustered keys. See GUIDs as PRIMARY KEYs and/or the clustering key for a more detailed discussion:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 2^63-1 rows
Also read Disk space is cheap...That's not the point! as a follow up.
Other than this, you need to do your homework and post the required details for such a question: exact table and index definition, prevalent data access pattern (by key, by range, filters sort order, joins etc etc).
Have you done any work to identify problems so far? If not, start with Waits and Queues, a proven methodology to identify performance bottlenecks. Once you measure and find places that need improvement, we can advise how to improve.
Add an Index(es). Decide which column(s) are the most appropriate clustered index.
Decide if storing 10MB of binary data in each (otherwise small) row is a good use of a database
[Updated in response to Remus's comment]

what type of database record id to use: long or guid?

In recent years I was using MSSQL databases, and all unique records in tables has the ID column type of bigint (long). It is autoincrementing and generally - works fine.
Currently I am observing people prefer to use GUIDs for record's identity.
Does it make sense to swap bigint to guid for unique record id?
I think it doesn't make sense as generating bigint as well as sorting would be always faster than guid, but... some troubles come when using two (or more) separated instances of application and database and keep them in sync, so you have to manage id pools between sql servers (for example: sql1 uses id's from 100 to 200, sql2 uses id's from 201 to 300) - this is a thin ice.
With guid id, you don't care about id pools.
What is your advice for my mirrored application (and db): stay with traditional ID's or move to GUIDs?
Thanks in advance for your reply!
guids have the
Advantages:
Being able to create them offline from the database without worrying about collisions.
You're never going to run out of them
Disadvantages:
Sequential inserts can perform poorly (especially on clustered indexes).
Sequential Guids fix this
Take up more space per row
creating one cleanly isn't cheap
but if the clients are generating them this is actually no problem
The column should still have a unique constraint (either as the PK or as a separate constraint if it is part of some other relationship) since there is nothing stopping someone supplying the GUID by hand and accidentally/deliberately violating uniqueness.
If the space doesn't bother you and your performance if not significantly impacted they make a lot of problems go away. The decision is inevitably specific to the individual needs of the application.
I use GUIDs in any scenario that involves either replication or client-side ID generation. It's just so much easier to manage identity in either of those situations with a GUID.
For two-tier scenarios like a web application talking directly to the database, or for servers that don't need to replicate (or perhaps only need to replicate in one direction, pub/sub style) then I think an auto-incrementing ID column is just fine.
As for whether to stay with autoincs or move to GUIDs ... it's one thing to advocate GUIDs in a green-field application where you can make all these decisions up front. It's another to advise someone to migrate an existing database. It might be more pain than it's worth.
GUIDs have issues with performance and concurrency when page splits occur. INTs can run page fill at 100% - only added at one end, GUIDS add everywhere so you probably have to run a lower fill - which wastes space throughout the index.
GUIDS can be allocated in the application, so the App can know the ID of the record it will have created, which can be handy; but, technically, it is possible for duplicate GUIDs to be generated (long odds, but at least put a Unique Index on GUID columns)
I agree for merging databases its easier. But for me a straight INT is better, and then live with the hassle of sorting out how to merge DBs when/if it is actually needed.
If your data move around often, then GUID is the best one for the Key of the table.
If you really care about the performance, just stick to int or bigint
If you want to leverage both of above, use int or bigint as the key of the table and each row can have a rowguid column so that the data can also be moved around easily without losing integrity.
If the ids are going to be displayed in the querystring, use Guids, otherwise use long as a rule.

Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?

For SQL server is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?
That depends on what you're doing:
If speed is the primary concern then a plain old int is probably big enough.
If you really will have more than 2 billion (with a B ;) ) records, then use bigint or a sequential guid.
If you need to be able to easily synchronize with records created remotely, then Guid is really great.
Update
Some additional (less-obvious) notes on Guids:
They can be hard on indexes, and that cuts to the core of database performance
You can use sequential guids to get back some of the indexing performance, but give up some of the randomness used in point two.
Guids can be hard to debug by hand (where id='xxx-xxx-xxxxx'), but you get some of that back via sequential guids as well (where id='xxx-xxx' + '123').
For the same reason, Guids can make ID-based security attacks more difficult- but not impossible. (You can't just type 'http://example.com?userid=xxxx' and expect to get a result for someone else's account).
In general I'd recommend a BIGINT over a GUID (as guids are big and slow), but the question is, do you even need that? (I.e. are you doing replication?)
If you're expecting less than 2 billion rows, the traditional INT will be fine.
Are you doing replication or do you have sales people who run disconnected databses that need to merge, use a GUID. Otherwise I'd go for an int or bigint. They are far easier to deal with in the long run.
Depends no what you need. DB Performance would gain from integer while GUIDs are useful for replication and not requiring to hear back from DB what identity has been created, i.e. code could create GUID identity before inserting into row.
If you're planning on using merge replication then a ROWGUIDCOL is beneficial to performance (see here for info). Otherwise we need more info about what your definition of 'better' is; better for what?
Unless you have a real need for a GUID, such as being able to generate keys anywhere and not just on the server, then I would stick with using INTEGER-based keys. GUIDs are expensive to create and make it harder to actually look at the data. Plus, have you ever tried to type a GUID in an SQL query? It's painful!
There can be few more aspects or requirements to use GUID.
If the primary key is of any numeric type (Int, BigInt or any other), then either you need to make it Identity column, or you need to check the last saved value in the table.
And in that case, if the record in foreign table is saved as transaction, then it would be difficult to get the last identity value of primary key. Like if IDENT_CURRENT is used, then will be again effect performance while saving record in foreign key.
So in case of saving the records as for transactions, then it would be convenient to firstly generate Guid for primary key, and then save the generated key (Guid) in primary and foreign table(s).
It really depends on whether or not the information coming in is somehow sequential. I highly recommend for things such as users that a GUID might be better. But for sequential data, such as orders or other things that need to be easily sortable that a bigint may well be a better solution as it will be indexed and provide fast sorting without the cost of another index.
It really depends whether you're expecting to have replication in the picture. Replication requires a row UUID, so if you're planning on doing that you may as well do it up front.
I'm with Andrew Rollings.
Now you could argue space efficiency. An int is what, 8 bytes max? A guid is going to much longer.
But I have two main reasons for preference: readability and access time. Numbers are easier for me than GUIDs (since I can always find the next/previous record easily).
As for access time, note that some DBs can start to have BIG problems with GUIDs. I know this is the case with MySQL (MySQL InnoDB Primary Key Choice: GUID/UUID vs Integer Insert Performance). This may not be much of a problem with SQL Server, but it's something to watch out for.
I'd say stick with INT or BIGINT. The only time I would think you'd want the GUID is when you are going to give them out and don't want people to be able to guess the IDs of other records for security reasons.

Advantages and disadvantages of GUID / UUID database keys

I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.
Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?
Advantages:
Can generate them offline.
Makes replication trivial (as opposed to int's, which makes it REALLY hard)
ORM's usually like them
Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.
Disadvantages:
Larger space use, but space is cheap(er)
Can't order by ID to get the insert order.
Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
Harder to do manual debugging, but not that hard.
Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.
I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:
unique ID for the row (GUID/whatever). Never visible to the user.
public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)
UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.
If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).
So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).
Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(
#Matt Sheppard:
Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).
So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.
Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.
Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).
In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.
The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.
The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.
There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )
Edit merged my two answers to this question
#Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..
GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).
One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...
primary-keys-ids-versus-guids
The Cost of GUIDs as Primary Keys (SQL Server 2000)
Myths, GUID vs. Autoincrement (MySQL 5)
This is realy what you want.
UUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.
RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.
Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.
Advantages:
UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.
UUID is more safer to pass through url than integer type data.
If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
UUID can generate offline.
One thing not mentioned so far: UUIDs make it much harder to profile data
For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both
provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)
provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)
From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.
For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).
UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.

Tables with no Primary Key

I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.

Resources