data distribution based on primary key - database

Currently in one of my projects, we're supporting 32k entities, however it's reaching its limits for performance, and hence we're thinking of distributing it to different databases based on their integer primary keys. E.g. the first 35k will go to one db, the next 35k to the next db and so on (based on (primary key % #db) logic).
However, this will present a problem when we're inserting an entity into db. Since we don't know its primary key value beforehand, how do we figure out which db to insert it into?
One possibility is maintaining a global id table in only one db. So we insert into it first, get the primary key value and then use it to choose a db for further detailed insertion. But this solution is not uniform and hence difficult to maintain and extend. So any other thoughts on how to go about it?

Found this nice article that talks about how Flickr solved this problem:
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

Related

data integration practice key field handling

Seeking thoughts and advice in addressing a customer request. We're working in an existing database that contains primary keys but no foreign keys, which won't change.
The request is for us to integrate data from outside sources which
may contain duplicate values in the data.
The data is from a records-management system, so key id values will
increase with each record collected.
The key fields, for the most part, are decimal(22,0).
This is a daily ETL load of a relatively small amount of data, so
performance isn't the most important concern.
Is it better to prefix, suffix, or use some other strategy to create unique key values that can be traced back to the source?
For instance, if the max existing value were 123456789, is it a good idea to prefix 100000 for external site A, 110000 for external site B, etc? We've batted a few ideas about as a team and there seems to be pro's and con's with everything we can think of. Not seeing much guidance on the web.
Thanks in advance for any ideas!
Generally it's better to have a separate field tracking whether the data is from A or B which is either part of your primary key or a supplementary index/constraint where you use a surrogate identity/auto-increment field as the primary key.
Among other things, this makes it easier to later do things like table partitioning or conditional indexes.
IMHO that's a design anti-pattern. You're ascribing special meaning to internal database keys. Bad idea. Also if you ever did hit the 10,000 limit you're completely stuffed. I know you think you won't... but you might and it will be a catastrophe.
The normal pattern (again in my experience and opinion) is to add a SRC_System and a SRC_Key field with a unique index/constraint on them. These are used to track where the record came from.
In addition to these two new fields, you also have your original primary key in this table which is an internal key used by the system.
This design allows any number of additional systems with any range of keys to be added to your system.
The only challenge then is working out
How to merge - i.e. if two systems have records representing the same thing
How to add additional reference data if required

Designing a n:m-Table for a Name->Attribute relationship

I have two tables, let's say "Names" and "Addresses". Now I want make a table that links idName to idAddress in a n:m relationship. For this table, there are multiple solutions:
1. Own Link-ID as Primary Key
2. Primary Key over the two id columns
3. Just an index over the two id columns (known as Heap-Table)
4. maybe even more?
Here is an SQLFiddle for my Example:
http://sqlfiddle.com/#!3/77665/7
Somehow I feel the artificial primary key is unneeded, on the other hand it's a good practice to have a single-column key for every table. So, the question is, what apprach would you reccomend and why? Or is there some other approach i'm missing?
I'd go with the unique IDENTITY column: this assures that the primary key of a new insert is always increasing. This is important as it means that inserts will always be at the end of the index and never somewhere in the middle of it. Inserting into the middle of an index can cause problems if the page it should go on is full: the page then has to be split into 2 (or more, depending on the actual index implementation). When that happens on an ordinary (non-clustered) index it's not generally a problem. When that happens to a clustered index, where the data is stored with the actual key, the problem is greater as it has to move around not just the keys but the data as well.
In most modern RDBM's such page-splitting isn't a problem. In older versions I did see one index split require the complete re-org of a large table of users in an e-commerce web site (using SQL Server 2000 I think) which had a GUID as a primary key. This meant that it had to take out a table lock and that basically stopped the site: no-one could use it as the user table was locked. It took several hours for us to sort it out and the site was offline rather longer than its owners wanted it to be...

Reasons not to use an auto-incrementing number for a primary key

I'm currently working on someone else's database where the primary keys are generated via a lookup table which contains a list of table names and the last primary key used. A stored procedure increments this value and checks it is unique before returning it to the calling 'insert' SP.
What are the benefits for using a method like this (or just generating a GUID) instead of just using the Identity/Auto-number?
I'm not talking about primary keys that actually 'mean' something like ISBNs or product codes, just the unique identifiers.
Thanks.
An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.
If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.
There's nothing inherently wrong with using AutoNumber, but there are a few reasons not to do it. Still, rolling your own solution isn't the best idea, as dacracot mentioned. Let me explain.
The first reason not to use AutoNumber on each table is you may end up merging records from multiple tables. Say you have a Sales Order table and some other kind of order table, and you decide to pull out some common data and use multiple table inheritance. It's nice to have primary keys that are globally unique. This is similar to what bobwienholt said about merging databases, but it can happen within a database.
Second, other databases don't use this paradigm, and other paradigms such as Oracle's sequences are way better. Fortunately, it's possible to mimic Oracle sequences using SQL Server. One way to do this is to create a single AutoNumber table for your entire database, called MainSequence, or whatever. No other table in the database will use autonumber, but anyone that needs a primary key generated automatically will use MainSequence to get it. This way, you get all of the built in performance, locking, thread-safety, etc. that dacracot was talking about without having to build it yourself.
Another option is using GUIDs for primary keys, but I don't recommend that because even if you are sure a human (even a developer) is never going to read them, someone probably will, and it's hard. And more importantly, things implicitly cast to ints very easily in T-SQL but can have a lot of trouble implicitly casting to a GUID. Basically, they are inconvenient.
In building a new system, I'd recommend using a dedicated table for primary key generation (just like Oracle sequences). For an existing database, I wouldn't go out of my way to change it.
from CodingHorror:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
The article provides a lot of good external links on making the decision on GUID vs. Auto Increment. If I can, I go with GUID.
It's useful for clients to be able to pre-allocate a whole bunch of IDs to do a bulk insert without having to then update their local objects with the inserted IDs. Then there's the whole replication issue, as mentioned by Galwegian.
The procedure method of incrementing must be thread safe. If not, you may not get unique numbers. Also, it must be fast, otherwise it will be an application bottleneck. The built in functions have already taken these two factors into account.
My main issue with auto-incrementing keys is that they lack any meaning
That's a requirement of a primary key, in my mind -- to have no other reason to exist other than identifying a record. If it has no real-world meaning, then it has no real-world reason to change. You don't want primary keys to change, generally speaking, because you have to search-replace your whole database or worse. I have been surprised at the sorts of things I have assumed would be unique and unchanging that have not turned out to be years later.
Here's the thing with auto incrementing integers as keys:
You HAVE to have posted the record before you get access to it. That means that until you have posted the record, you cannot, for example, prepare related records that will be stored in another table, or any one of a lot of other possible reasons why it might be useful to have access to the new record's unique id, before posting it.
The above is my deciding factor, whether to go with one method, or the other.
Using a unique identifiers would allow you to merge data from two different databases.
Maybe you have an application that collects data in multiple database and then "syncs" with a master database at various times in the day. You wouldn't have to worry about primary key collisions in this scenario.
Or, possibly, you might want to know what a record's ID will be before you actually create it.
One benefit is that it can allow the database/SQL to be more cross-platform. The SQL can be exactly the same on SQL Server, Oracle, etc...
The only reason I can think of is that the code was written before sequences were invented and the code forgot to catch up ;)
I would prefer to use a GUID for most of the scenarios in which the post's current method makes any sense to me (replication being a possible one). If replication was the issue, such a stored procedure would have to be aware of the other server which would have to be linked to ensure key uniqueness, which would make it very brittle and probably a poor way of doing this.
One situation where I use integer primary keys that are NOT auto-incrementing identities is the case of rarely-changed lookup tables that enforce foreign key constraints, that will have a corresponding enum in the data-consuming application. In that scenario, I want to ensure the enum mapping will be correct between development and deployment, especially if there will be multiple prod servers.
Another potential reason is that you deliberately want random keys. This can be desirable if, say, you don't want nosey browsers leafing through every item you have in the database, but it's not critical enough to warrant actual authentication security measures.
My main issue with auto-incrementing keys is that they lack any meaning.
For tables where certain fields provide uniqueness (whether alone or in combination with another), I'd opt for using that instead.
A useful side benefit of using a GUID primary key instead of an auto-incrementing one is that you can assign the PK value for a new row on the client side (in fact you have to do this in a replication scenario), sparing you the hassle of retrieving the PK of the row you just added on the server.
One of the downsides of a GUID PK is that joins on a GUID field are slower (unless this has changed recently). Another upside of using GUIDs is that it's fun to try and explain to a non-technical manager why a GUID collision is rather unlikely.
Galwegian's answer is not necessarily true.
With MySQL you can set a key offset for each database instance. If you combine this with a large enough increment it will for fine. I'm sure other vendors would have some sort of similar settings.
Lets say we have 2 databases we want to replicate. We can set it up in the following way.
increment = 2
db1 - offset = 1
db2 - offset = 2
This means that
db1 will have keys 1, 3, 5, 7....
db2 will have keys 2, 4, 6, 8....
Therefore we will not have key clashes on inserts.
The only real reason to do this is to be database agnostic (if different db versions use different auto-numbering techniques).
The other issue mentioned here is the ability to create records in multiple places (like in the central office as well as on traveling users' laptops). In that case, though, you would probably need something like a "sitecode" that was unique to each install that was prefixed to each ID.

Advantages and disadvantages of GUID / UUID database keys

I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.
Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?
Advantages:
Can generate them offline.
Makes replication trivial (as opposed to int's, which makes it REALLY hard)
ORM's usually like them
Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.
Disadvantages:
Larger space use, but space is cheap(er)
Can't order by ID to get the insert order.
Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
Harder to do manual debugging, but not that hard.
Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.
I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:
unique ID for the row (GUID/whatever). Never visible to the user.
public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)
UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.
If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).
So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).
Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(
#Matt Sheppard:
Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).
So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.
Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.
Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).
In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.
The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.
The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.
There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )
Edit merged my two answers to this question
#Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..
GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).
One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...
primary-keys-ids-versus-guids
The Cost of GUIDs as Primary Keys (SQL Server 2000)
Myths, GUID vs. Autoincrement (MySQL 5)
This is realy what you want.
UUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.
RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.
Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.
Advantages:
UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.
UUID is more safer to pass through url than integer type data.
If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
UUID can generate offline.
One thing not mentioned so far: UUIDs make it much harder to profile data
For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both
provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)
provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)
From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.
For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).
UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.

Tables with no Primary Key

I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.

Resources