Managing identity/primary key across servers - sql-server

I'm in the middle of designing a new database that will need to support replication and I'm stuck on the decision what to choose as my primary key.
In our current database for primary key we use two int columns, first column is identity and the other is used to describe on which server the row is inserted. Now I want to avoid using two columns for primary key, and just use one column instead. So far I have two way of doing this:
Use GUID for my primary key
This one will ensure that there is always a unique key across any number of servers. What I don't like with this one, is that the GUID is 16bytes in size, and when used for foreign key across many tables it will waste space. Also it is harder to use it when writing queries, and it will slower to query.
Use the int or bigint, and manually specify the seed and increment value for every table on each server. For example if there is two servers, the X table on the first server will start from number 1, and on second server it will start from number 2, each will increment by 2. So there would be (1,3,5,...) on first, and (2,4,6,...) on second server. The good thing with this design is that it is easier to use when writing queries, it's fast, and it use less space for foreign keys. Bad thing is that we never know how many servers will be running so it's harder to tell what will be the increment value. Also it's harder to manage the schema change on server.
What is the best practice for managing multiple servers, and what's the best way, if any, to do in this kind if situations?

Your question is a good one, and one that is asked often.
From a maintenance perspective, I would absolutely go with GUIDS. They are there for a reason.
Somewhere along the line you might run into complex operations moving and re-replicating your data, and then the other options can make it a little more complex than it needs to be.
There is a very nice short read about the various options here:
http://msdn.microsoft.com/en-us/library/bb726011.aspx
As for the Replication part - if done properly, there are no real headaches with replication.

Update:
Found a more simple/manual method here. Involves using NOT FOR REPLICATION and staggering identity seed values as you mentioned in comments.
Original:
Your best bet is something like the second option listed. Assign identity ranges for the replication publisher and subscriber instances, then turn on automatic range management.
This article discusses options for managing identity columns in replication, and enabling identity range management is discussed here.
Since you don't know how many servers will be in the replication pool, you may need to reconfigure article properties periodically.

I dare to advise against replication altogether :) it's certainly more pain than fun. If you can afford, look into Sync framework.
Playing with identity is not flexible to say the least. Consider adding moving servers. Identity insert, different schemas and so on.
GUID would be alright for a clustered key if you used newsequentialid() as a default value. It is a bit larger (a number of bits), but it solves the hassle once and for good :)
The way I'd go is to have an int identity clustered key which is only relevant to the database context. Then add a GUID column, which makes sense to the synchronization context. Top it up with a rowversion column to see what's ready for sync.

Related

Primary Key type in Cloud Share Database web app

i'm building a cloud web app with asp mvc and mssql. Clients will share the same database.
I still havent decided what type my primary key will be. Should i use GUID or Bigint?
with bigint, i'm worried about scalabilty. with GUID, i'm afraid of performance.
What is the best practice here? What big cloud website like stackoverflow use as primary key?
Please shed me some light.
Thanks,
Reynaldi
I would use Guids. You don't want sequential ID's when dealing with backups and restore.
I would use INT as the performance is better than any other datatype the best Datatype you can have for your Primary key. You can get more then 2 billion records using INT and if you get to a point where you need more then 2 billion Primary keys in one table you should consider Table Partitioning. But using GUID or BIGINT will have the performance penalty anyway even if you have only few thousand records in your table. INT is a 4 byte datatype and GUID is a 16-byte datatype, So you could say INT x 4 Memory is required to store the GUID Column and Plus the performance penalty. Here is a very details answer by marc_s Detailed Comparison of GUID Vs Int
Both have their good and bad. The real question is not about what datatype to use, but rather "in what way is the data going to be used"?
For example, if you are going to use the data in a manner where you will have millions of reads, and you require the data in the inserted order, and it makes sense to 'sort' the data in the manner that it was inserted, then perhaps an IDENTITY INT column will do. No need for BIGINT really, you can even seed your INT column with -2 billion and have potentially 4 billion records in there, and I doubt you will get there. You will not want to use int or bigint if you ever want to 'shard' or scale your table out. Especially when you use SQL Azure, which makes scaling out simple, you then would want to steer more in the direction of using a GUID.
Sure with a GUID it is not 'sorted' in the table in inserted order, but it definitely makes sense to use if you plan to scale out, or even use to join tables in other databases.
If you are going to do massive batch inserts, say 10,000 rows plus per batch, then something like an INT will be better suited, since it will avoid page splits, or keep it to a minimum rather. However, if the batch inserts happens during quiet hours, this is not an issue.
Again, SQL Server indexes are brilliant, and if you can get a proper index for your table, there is no need not to use either of them.
Indexes is a whole other kettle of fish, but since it was not part of the question, I will not attempt to answer it here, but I would spend a bit of time understanding the impact of too many/little or the wrong indexes on a table, if I were you.
Really the answer to your question is not as simple as GUID or INT/BIGINT, but rather a holistic view and understanding of your application, its usage, and how and when your table will be used. Only then can you make a decision that will best suit your table.
I hope this helps.

Questions about database indexes

When a database index is created for a unique constraint on a field or multiple indexes are created for a multiple field unique constraint, are those same indexes able to be used for efficiency gains when querying for objects, much in the same way any other database index is used? My guess is that the indexes created for unique constraints are the same as ones created for efficiency gains, and the unique constraint itself is something additional, but I'm not well versed in databases.
Is it ever possible to break a unique constraint, including a multiple field constraint (e.g. field_a and field_b are unique together) in any way through long transactions and high concurrency, etc? Or, does a unique constraint offer 100% protection.
As to question 1:
YES - these are indexes as any other indexes you define and are used in query plans for example for performance gains... you can define unique indexes without defining a "unique contraint" btw.
As to question 2:
YES - it is a 100% protection as long as the DB engine is is ACID compliant and reliable (i.e. no bugs in this regard) and as long as you don't temporarily disable the constraint.
Yes. A unique constraint is an index (in SQL Server) and will (can) be used in query plans
This is impossible. Regardless of transaction times or concurrency issues, you cannot store data in a table that violates a constraint (at least in SQL Server). BTW, if your transactions are so long that you're worried about this, you NEED to rethink what you're doing in the context of this transaction. Even though you won't violate database constraints with long transaction operations, YOU WILL run into other problems.
The problem with your question is, that it is very general and not tailored to a specific implementation. Therefore any answer will be quite generic to.
In this mind:
Whenever a database thinks, that access via an index might speed up things, it will do so - uniqueness is not concern here. If many indizes exists on one table a decent database will try to use the "best" one - with different views about what "best" means actually. BUT many databases will only use one index to get a row. Therefore as a rule of thumb DBs usually try to use indizes where lookups result in as few rows as possible. A unique index is quite good at this. :-)
Actually this is not one point but two different points:
A decent DB will not corrupt your index even for long running transactions or high concurrency. At least not on purpose. And if it does it is either a bug in the DB software which has to be fixed very quickly - otherwise the DB vendor might suffer reputation loss in a very drastic way. The other possibility is, that it is not a decent DB but a mere persistent hashmap or something like that. If the data really matters, then high concurrency and longrunning transactions are no excuse.
Multivalued unique indices are a beast: DB implementations are silghty different, what they consider "unique" when one or more of the key columns contain NULL. For example you can look at the PostgreSQL documentation regarding this point: http://www.postgresql.org/docs/9.1/interactive/indexes-unique.html
Hope this makes some things clear.

Access database performance

For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.

GUID vs INT IDENTITY [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you like your primary keys?
I'm aware of the benefits of using a GUID, as well as the benefits of using and INT as a PK in a database. Considering that a GUID is in essence a 128 bit INT and a normal INT is 32 bit, the INT is a space saver (though this point is generally moot in most modern systems).
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
Kimberley Tripp (SQLSkills.com) has an article on using GUID's as primary keys. She advices against it because of the unnecessary overhead.
To answer your question:
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
I would use a GUID if my system would have an online/offline version that inside the offline version you can save data and that data is transferred back to the server one day during a synch. That way, you are sure that you won't have the same key twice inside your database.
We have Guids in our very complex enterprise software everywhere. Works smoothly.
I believe Guids are semantically more suitable to serve as identifiers. There is also no point in unnecessarily worrying about performance until you are faced with that problem. Beware premature optimization.
There is also an advantage with database migration of any sort. With Guids you will have no collisions. If you attempt to merge several DBs where ints are used for identity, you will have to replace their values. If these old values were used in urls, they will now be different following SEO hit.
Apart from being a poor choice when you need to synchronize several database instances, INT's have one drawback I haven't seen mentioned: inserts always occur at one end of the index tree. This increases lock contention when you have a table with a lot of movement (since the same index pages have to be modified by concurrent inserts, whereas GUID's will be inserted all over the index). The index may also have to be rebalanced more often if a B* tree or similar data structure is used.
Of course, int's are easier on the eye when doing manual queries and report construction, and space consumption may add up through FK usages.
I'd be interested to see any measurements of how well e.g. SQL Server actually handles insert-heavy tables with IDENTITY PK's.
the INT is a space saver (though this
point is generally moot in most modern
systems).
Not so. It may seem so at first glance, but note that the primary key of each table will be repeated multiple times throughout the database in indexes and as foreign key in other tables. And it will be involved in nearly any query containing its table - and very intensively when it's a foreign key used for a join.
Furthermore, remember that modern CPUs are very, very fast, but RAM speeds have not kept up. Cache behaviour becomes therefore increasingly important. And the best way to get good cache behaviour is to have smaller data sets. So the seemingly irrelevant difference between 4 and 16 bytes may well result in a noticeable difference in speed. Not necessarily always - but it's something to consider.
When comparing values such as Primary to Foreign key relationship, the INT will be faster. If the tables are indexed properly and the tables are small, you might not see much of a slow down, but you'd have to try it to be sure. INTs are also easier to read, and communicate with other people. It's a lot simpler to say, "Can you look at record 1234?" instead of "Can you look at record 031E9502-E283-4F87-9049-CE0E5C76B658?"
If you are planning on merging database at some stage, ie for a multi-site replication type setup, Guid's will save a lot of pain. But other than that I find Int's easier.
If the data lives in a single database (as most data for the applications that we write in general does), then I use an IDENTITY. It's easy, intended to be used that way, doesn't fragment the clustered index and is more than enough. You'll run out of room at 2 billion some records (~ 4 billion if you use negative values), but you'd be toast anyway if you had that many records in one table, and then you have a data warehousing problem.
If the data lives in multiple, independent databases or interfaces with a third-party service, then I'll use the GUID that was likely already generated. A good example would be a UserProfiles table in the database that maps users in Active Directory to their user profiles in the application via their objectGUID that Active Directory assigned to them.
Some OSes don't generate GUIDs anymore based on unique hardware features (CPUID,MAC) because it made tracing users to easy (privacy concerns). This means the GUID uniqueness is often no longer as universal as many people think.
If you use some auto-id function of your database, the database could in theory make absolutely sure that there is no duplication.
I always think PK's should be numeric where possble. Dont forget having GUIDs as a PK will probably mean that they are also used in other tables as foriegn keys, so paging and index etc will be greater.
An INT is certainly much easier to read when debugging, and much smaller.
I would, however, use a GUID or similar as a license key for a product. You know it's going to be unique, and you know that it's not going to be sequential.
I think the database also matters. From a MySQL perspective - generally, the smaller the datatype the faster the performance.
It seems to hold true for int vs GUID too -
http://kccoder.com/mysql/uuid-vs-int-insert-performance/
I would use GUID as PK only if this key bounds to similar value. For example, user id (users in WinNT are describes with GUIDs), or user group id.
Another one example. If you develop distributed system for documents management and different parts of system in different places all over the world can create some documents. In such case I would use GUID, because it guaranties that 2 documents created in different parts of distributed system wouldn't have same Id.

Preventing Duplicate Keys Between Multiple Databases

I'm in a situation where we have a new release of software coming out that's going to use a separate database (significant changes in the schema) from the old. There's going to be a considerable period of time where both the new and the old system will be in production and we have a need to ensure that there are unique IDs being generated between the two databases (we don't want a row in database A to have the same ID as a row in database B). The databases are Sybase.
Possible solutions I've come up with:
Use a data type that supports really large numbers and allocate a range for each, hoping they never overflow.
Use negative values for one database and positive values the other.
Adding an additional column that identifies the database and use the combination of that and the current ID to serve as the key.
Cry.
What else could I do? Are there more elegant solutions, someway for the two databases to work together? I believe the two databases will be on the same server, if that matters.
GUIDs or Globally Unique IDs can be handy primary keys in this sort of case. I believe Sybase supports them.
Edit: Although if your entire database is already based around integer primary keys, switching to GUIDs might not be very practical - in that case just partition the integer space between the databases, as GWLlosa and others have suggested.
I've seen this happen a few times. On the ones I was involved with, we simply allocated a sufficiently large ID space for the old one (since it had been in operation for a while, and we knew how many keys it'd used, we could calculate how many more keys it'd need for a specified 'lifespan') and started the "ID SEQUENCE" of the new database above that.
I would recommend against any of the other tricks, if only because they all require changes to the 'legacy' app, which is a risk I don't see the need to take.
In your case I would consider using uniqueidentifier (GUIDs) as datatype. They are considered unique when you create them using the system function newid().
CREATE TABLE customer (
cust_key UNIQUEIDENTIFIER NOT NULL
DEFAULT NEWID( ),
rep_key VARCHAR(5),
PRIMARY KEY(cust_key))
Pre-seed your new Database with a # that is greater than what your old database will reach before you merge them.
I have done this in the past and the volume of records was reasonably low so I started my new database at 200,000 as the initial record number. then when the time came, I just migrated all of the old records into the new system.
Worked out perfectly!
GUIDs are designed for this situation.
If you will be creating a similar number of new items on each database, you can try even ids on one database, odd ids in the other.
Use the UNIQUEIDENTIFIER data type. I know it works in MS SQL Server and as far as I can see from Googling, Sybase supports it.
http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbrfen9/00000099.htm

Resources