DB design for synced desktop application - sql-server

I'm building a desktop application that will run on multiple laptops. It will need to sync up to a central database whenever the user is back in the office and has access again.
My biggest problem to overcome is how to design the database so that it is easily synced with the central database server. One of the major hurdles is trying to determine how to handle keys such that they don't duplicate across the multiple laptop databases that'll be in use.
For example, say Laptop 1 enters a new customer called "Customer A" - using a unique id, it might be assigned a customer ID of 20. Laptop 2 enters "Customer C" - it could also assign the ID of 20 to that customer. When it comes time to sync, both customer A & C would end up on the server with a duplicate ID.
Has anyone worked with an app similar to this that has an elegant solution?

Alternatives are:
ranges of ids, which are difficult to implement and to enforce. They are very hard to manage, is difficult to provision an new site/range and is even more difficult to cope with retired sites/ranges.
composite keys, (siteId, EntityId) are better than ranges but require upfront design and may cause problems on queries that aggregate multiple/all sites (central DWs) because the leftmost key in all indexes (siteId) is not specified.
GUIDs. In theory they are perfect as they guarantee uniqueness (they may conflict in theory, I'm yet to see a true guid conflict). In practice they are very poor clustered key choices because of the size (16 bytes) and because of the fragmentation. Still, they are much easier to get right than the composite key approach.

Use Guids (aka uniqueidentifier) for your primary keys, and all of your problems with replication will go away. The penalties for using Guids are almost always drastically overstated by proponents of auto-increment int PKs. The extra size they require (16 bytes vs. 4 bytes for an int) is so trivial that I'm amazed it ever comes up in discussions on this subject. Queries that JOIN on Guid PKs will run slower than queries that JOIN on int PKs, but it is not an order-of-magnitude (i.e. 10X) slower. Your results may vary, but I've found that Guid JOIN queries take around 40% longer to execute (which may seem like a large penalty until you remember Moore's law).
Using Guids for PKs also gets you out of doing really hacky stuff when you're inserting related data. Instead of having to insert the child records and then retrieve the IDs for the just-inserted rows before inserting the parent rows, you just create all the IDs client-side with Guid.NewGuid().
I've worked with replicated systems before that used assigned ranges with int PKs to solve this problem, but I will never ever do that again.

I don't know that the is an "elegant" solution to this very inelegant problem. Remus has good thoughts, also suggest you do some reading on replication.
You had also better design a de-dup process because sure as anything, rep A is going to upt in customer A and rep b is going to put in customer A as well and because they came from differnt sources with differnt primary keys, they will be different records.

Related

Composite primary keys in databases

I need to create a database design where bar or event owners (admin-table because they own a profile) send alerts to users.
I want to create an "ALERT" table but have a hard time deciding what primary key to use.
I thought adding a composite key containing at least admin_ID (PFK), user_ID (PFK) and I thought to add Date and Time stamp to the primary key to indicate that many alerts (notifications) can be sent by an admin, but only 1 at a certain point in time.
However, from this thread: Timestamp as part of composite primary key?, I learned that I shouldn't use the timestamp.
By the way, it has not been decided what DB software we will use at this point.
I sometimes have the tendency to quickly move to an autoincrement key. I have never noticed problems with that (in Access), however from what I read, this may not always be the most meaningful thing to do, therefore I ask the question to professionals.
I only have one chance to do this the right way to make the back end fundamentally right.
What are your thoughts on this?
Your going to get a lot of heated debate over the issue of what to use for a PK. A purist will tell you that you should never use an auto-increment key (assuming SQL SERVER). Someone who's been in the real trenches will tell you they are perfectly fine as a PK (in most cases) and have some real advantages.
I won't try to persuade you one way or the other. But I will tell you my experience. I have been developing software for 25 years and have been using an RDBMS for much of that time (mostly SQL Server). Personally, I find auto-increment PKs invaluable and 99.99% of the time would never consider using anything else. Why?
SQL Server generates them and, therefore, handles all concurrency issues.
They are small in size and, therefore, lookups on these keys are very fast.
You can easily refer to a particular row in a table by its Id value. This may not sound like a big deal, but it can sure come in handy. For example, I cannot tell you how many times I've asked a co-developer to take a look at a row in a table with key value 12345. A simple thing, but very useful.
FKs that reference a PK in a table will be the same type, and, therefore, also small and fast, and easy to work with.
Little or no fragmentation of the index, especially if the PK is a clustered index (in SQL Server).
There can be one big disadvantage with this sort of PK. If you ever have to merge rows from one database to another, you have the potential of running into PK value collisions. However, there are ways to work around this as well. Also, of course, an auto-increment value will have no true meaning. But that's ok. Its purpose is to provide uniqueness.
Is this a perfect solution. Nope. And others will disagree with the use of them. However, they have worked very well for me, for most high-end, and business critical, projects.
you have always the option to use UUID ( http://en.wikipedia.org/wiki/Universally_unique_identifier ) for your database primary keys.
All major programming languages have a support for it and IMO it is one choice, probably though not the best because it suffers from low efficiency .
There is nothing wrong with composite primary keys per se
There is a problem if any of a PK's components is NULLable
... such as when any of the components also functions as a FK to another table (or "domain")

Database Design Primay Key, ID vs String

I am currently planning to develop a music streaming application. And i am wondering what would be better as a primary key in my tables on the server. An ID int or a Unique String.
Methods 1:
Songs Table:
SongID(int), Title(string), *Artist**(string), Length(int), *Album**(string)
Genre Table
Genre(string), Name(string)
SongGenre:
***SongID****(int), ***Genre****(string)
Method 2
Songs Table:
SongID(int), Title(string), *ArtistID**(int), Length(int), *AlbumID**(int)
Genre Table
GenreID(int), Name(string)
SongGenre:
***SongID****(int), ***GenreID****(int)
Key: Bold = Primary Key, *Field** = Foreign Key
I'm currently designing using method 2 as I believe it will speed up lookup performance and use less space as an int takes a lot less space then a string.
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Yes. Integer IDs are very bad if you need to uniquely identify the same data outside of a single database. For example, if you have to copy the same data into another database system with potentially pre-existing data or you have a distributed database. The biggest thing to be aware of is that an integer like 7481 has no meaning outside of that one database. If later on you need to grow that database, it may be impossible without surgically removing your data.
The other thing to keep in mind is that integer IDs aren't as flexible so they can't easily be used for exceptional cases. The designers of the Internet Protocol understood this and took precautions by allocating certain blocks of numbers as "special" in one way or another (broadcast IPs, private IPs, network IPs). But that was only possible because there's a protocol surrounding the usage of those numbers. Many databases don't operate within such a well-defined protocol.
FWIW, it's kind of like trying to decide if having a "strongly typed" programming paradigm is better than a "weakly/dynamically typed" programming paradigm. It will depend on what you need to do.
You are doing the right thing - identity field should be numeric and not string based, both for space saving and for performance reasons (matching keys on strings is slower than matching on integers).
From the software perspective the GUID is better as its unique globally.
Quotes from: Primary Keys: IDs versus GUIDs
Using a GUID as a row identity value feels more natural-- and
certainly more truly unique-- than a 32-bit integer. Database guru Joe
Celko seems to agree. GUID primary keys are a natural fit for many
development scenarios, such as replication, or when you need to
generate primary keys outside the database. But it's still a question
of balancing the tradeoffs between traditional 4-byte integer IDs and
16-byte GUIDs:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
you're not careful
Cumbersome to debug where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}'
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
clustered indexes
My recommendation is: use ids.
You'll be able to rename that "Genre" with 20000 songs without breaking anything.
The idea behind this is that the id identifies the row in the table. Whatever the row has is something that doesn't matters in this problem.
This is in large part a matter of personal preference.
My personal opinion and practice is to always use integer keys and to always use surrogate rather than natural keys (so never use anything like social security number or the genre name directly).
There are cases where an auto number field is not appropriate or does not scale. In these cases it can make sense to use a GUID, which can be a string in databases that do not have a native datatype for it.

what type of database record id to use: long or guid?

In recent years I was using MSSQL databases, and all unique records in tables has the ID column type of bigint (long). It is autoincrementing and generally - works fine.
Currently I am observing people prefer to use GUIDs for record's identity.
Does it make sense to swap bigint to guid for unique record id?
I think it doesn't make sense as generating bigint as well as sorting would be always faster than guid, but... some troubles come when using two (or more) separated instances of application and database and keep them in sync, so you have to manage id pools between sql servers (for example: sql1 uses id's from 100 to 200, sql2 uses id's from 201 to 300) - this is a thin ice.
With guid id, you don't care about id pools.
What is your advice for my mirrored application (and db): stay with traditional ID's or move to GUIDs?
Thanks in advance for your reply!
guids have the
Advantages:
Being able to create them offline from the database without worrying about collisions.
You're never going to run out of them
Disadvantages:
Sequential inserts can perform poorly (especially on clustered indexes).
Sequential Guids fix this
Take up more space per row
creating one cleanly isn't cheap
but if the clients are generating them this is actually no problem
The column should still have a unique constraint (either as the PK or as a separate constraint if it is part of some other relationship) since there is nothing stopping someone supplying the GUID by hand and accidentally/deliberately violating uniqueness.
If the space doesn't bother you and your performance if not significantly impacted they make a lot of problems go away. The decision is inevitably specific to the individual needs of the application.
I use GUIDs in any scenario that involves either replication or client-side ID generation. It's just so much easier to manage identity in either of those situations with a GUID.
For two-tier scenarios like a web application talking directly to the database, or for servers that don't need to replicate (or perhaps only need to replicate in one direction, pub/sub style) then I think an auto-incrementing ID column is just fine.
As for whether to stay with autoincs or move to GUIDs ... it's one thing to advocate GUIDs in a green-field application where you can make all these decisions up front. It's another to advise someone to migrate an existing database. It might be more pain than it's worth.
GUIDs have issues with performance and concurrency when page splits occur. INTs can run page fill at 100% - only added at one end, GUIDS add everywhere so you probably have to run a lower fill - which wastes space throughout the index.
GUIDS can be allocated in the application, so the App can know the ID of the record it will have created, which can be handy; but, technically, it is possible for duplicate GUIDs to be generated (long odds, but at least put a Unique Index on GUID columns)
I agree for merging databases its easier. But for me a straight INT is better, and then live with the hassle of sorting out how to merge DBs when/if it is actually needed.
If your data move around often, then GUID is the best one for the Key of the table.
If you really care about the performance, just stick to int or bigint
If you want to leverage both of above, use int or bigint as the key of the table and each row can have a rowguid column so that the data can also be moved around easily without losing integrity.
If the ids are going to be displayed in the querystring, use Guids, otherwise use long as a rule.

Reasons not to use an auto-incrementing number for a primary key

I'm currently working on someone else's database where the primary keys are generated via a lookup table which contains a list of table names and the last primary key used. A stored procedure increments this value and checks it is unique before returning it to the calling 'insert' SP.
What are the benefits for using a method like this (or just generating a GUID) instead of just using the Identity/Auto-number?
I'm not talking about primary keys that actually 'mean' something like ISBNs or product codes, just the unique identifiers.
Thanks.
An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.
If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.
There's nothing inherently wrong with using AutoNumber, but there are a few reasons not to do it. Still, rolling your own solution isn't the best idea, as dacracot mentioned. Let me explain.
The first reason not to use AutoNumber on each table is you may end up merging records from multiple tables. Say you have a Sales Order table and some other kind of order table, and you decide to pull out some common data and use multiple table inheritance. It's nice to have primary keys that are globally unique. This is similar to what bobwienholt said about merging databases, but it can happen within a database.
Second, other databases don't use this paradigm, and other paradigms such as Oracle's sequences are way better. Fortunately, it's possible to mimic Oracle sequences using SQL Server. One way to do this is to create a single AutoNumber table for your entire database, called MainSequence, or whatever. No other table in the database will use autonumber, but anyone that needs a primary key generated automatically will use MainSequence to get it. This way, you get all of the built in performance, locking, thread-safety, etc. that dacracot was talking about without having to build it yourself.
Another option is using GUIDs for primary keys, but I don't recommend that because even if you are sure a human (even a developer) is never going to read them, someone probably will, and it's hard. And more importantly, things implicitly cast to ints very easily in T-SQL but can have a lot of trouble implicitly casting to a GUID. Basically, they are inconvenient.
In building a new system, I'd recommend using a dedicated table for primary key generation (just like Oracle sequences). For an existing database, I wouldn't go out of my way to change it.
from CodingHorror:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
The article provides a lot of good external links on making the decision on GUID vs. Auto Increment. If I can, I go with GUID.
It's useful for clients to be able to pre-allocate a whole bunch of IDs to do a bulk insert without having to then update their local objects with the inserted IDs. Then there's the whole replication issue, as mentioned by Galwegian.
The procedure method of incrementing must be thread safe. If not, you may not get unique numbers. Also, it must be fast, otherwise it will be an application bottleneck. The built in functions have already taken these two factors into account.
My main issue with auto-incrementing keys is that they lack any meaning
That's a requirement of a primary key, in my mind -- to have no other reason to exist other than identifying a record. If it has no real-world meaning, then it has no real-world reason to change. You don't want primary keys to change, generally speaking, because you have to search-replace your whole database or worse. I have been surprised at the sorts of things I have assumed would be unique and unchanging that have not turned out to be years later.
Here's the thing with auto incrementing integers as keys:
You HAVE to have posted the record before you get access to it. That means that until you have posted the record, you cannot, for example, prepare related records that will be stored in another table, or any one of a lot of other possible reasons why it might be useful to have access to the new record's unique id, before posting it.
The above is my deciding factor, whether to go with one method, or the other.
Using a unique identifiers would allow you to merge data from two different databases.
Maybe you have an application that collects data in multiple database and then "syncs" with a master database at various times in the day. You wouldn't have to worry about primary key collisions in this scenario.
Or, possibly, you might want to know what a record's ID will be before you actually create it.
One benefit is that it can allow the database/SQL to be more cross-platform. The SQL can be exactly the same on SQL Server, Oracle, etc...
The only reason I can think of is that the code was written before sequences were invented and the code forgot to catch up ;)
I would prefer to use a GUID for most of the scenarios in which the post's current method makes any sense to me (replication being a possible one). If replication was the issue, such a stored procedure would have to be aware of the other server which would have to be linked to ensure key uniqueness, which would make it very brittle and probably a poor way of doing this.
One situation where I use integer primary keys that are NOT auto-incrementing identities is the case of rarely-changed lookup tables that enforce foreign key constraints, that will have a corresponding enum in the data-consuming application. In that scenario, I want to ensure the enum mapping will be correct between development and deployment, especially if there will be multiple prod servers.
Another potential reason is that you deliberately want random keys. This can be desirable if, say, you don't want nosey browsers leafing through every item you have in the database, but it's not critical enough to warrant actual authentication security measures.
My main issue with auto-incrementing keys is that they lack any meaning.
For tables where certain fields provide uniqueness (whether alone or in combination with another), I'd opt for using that instead.
A useful side benefit of using a GUID primary key instead of an auto-incrementing one is that you can assign the PK value for a new row on the client side (in fact you have to do this in a replication scenario), sparing you the hassle of retrieving the PK of the row you just added on the server.
One of the downsides of a GUID PK is that joins on a GUID field are slower (unless this has changed recently). Another upside of using GUIDs is that it's fun to try and explain to a non-technical manager why a GUID collision is rather unlikely.
Galwegian's answer is not necessarily true.
With MySQL you can set a key offset for each database instance. If you combine this with a large enough increment it will for fine. I'm sure other vendors would have some sort of similar settings.
Lets say we have 2 databases we want to replicate. We can set it up in the following way.
increment = 2
db1 - offset = 1
db2 - offset = 2
This means that
db1 will have keys 1, 3, 5, 7....
db2 will have keys 2, 4, 6, 8....
Therefore we will not have key clashes on inserts.
The only real reason to do this is to be database agnostic (if different db versions use different auto-numbering techniques).
The other issue mentioned here is the ability to create records in multiple places (like in the central office as well as on traveling users' laptops). In that case, though, you would probably need something like a "sitecode" that was unique to each install that was prefixed to each ID.

Advantages and disadvantages of GUID / UUID database keys

I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.
Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?
Advantages:
Can generate them offline.
Makes replication trivial (as opposed to int's, which makes it REALLY hard)
ORM's usually like them
Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.
Disadvantages:
Larger space use, but space is cheap(er)
Can't order by ID to get the insert order.
Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
Harder to do manual debugging, but not that hard.
Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.
I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:
unique ID for the row (GUID/whatever). Never visible to the user.
public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)
UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.
If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).
So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).
Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(
#Matt Sheppard:
Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).
So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.
Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.
Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).
In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.
The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.
The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.
There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )
Edit merged my two answers to this question
#Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..
GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).
One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...
primary-keys-ids-versus-guids
The Cost of GUIDs as Primary Keys (SQL Server 2000)
Myths, GUID vs. Autoincrement (MySQL 5)
This is realy what you want.
UUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.
RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.
Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.
Advantages:
UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.
UUID is more safer to pass through url than integer type data.
If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
UUID can generate offline.
One thing not mentioned so far: UUIDs make it much harder to profile data
For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both
provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)
provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)
From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.
For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).
UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.

Resources