PK generation performance: DB generated sequence or app-generated random - database

I am trying to understand if it's more performant for a database to manage id/sequence generation for use as a PK or to have an application simply generate it's own ID as, say a random 64-bit number.
My suspicion is that application + random generation will be more performant because there is less concurrency to manage, but then again, if the DB generates a sequence then it always knows that what comes next is ordered and can optimize index maintenance.

The idea behind PK is that they are unique by design. Using a 64 bit random number is not unique by design - so you have to prepare your code for PK collision, then randomizing a new id, and submit the new row again to the database. It can be quite a hassle. Having writers inserting new data in concurrency will eventually make these problem even worse.
The database can generate unique ids nicely - after all, the PK problem is pretty standard stuff, and there are good standard solutions for that.
The ID generator itself is a bottleneck - since it has to ensure uniqueness even if it is called from multiple sessions in parallel. For example, Oracle has sequence generators, which pre-generate and "cache" upcoming numbers per session, this way addressing the concurrency issue.
One important thing we can learn from Mr.Tom Kyte: measure your use case using realistic data in the tables and realistic load. This is the only source of true wisdom :)
My two cents: database engines differ a lot! Sharing what database you're using helps getting better answers.

Related

Removing PAGELATCH with randomized ID instead of GUID

We have two tables which receive 1 million+ insertions per minute. This table is heavily indexed, which also can’t be removed to support business requirements. Due to such high volume of insertions, we are seeing PAGELATCH_EX and PAGELATCH_SH. These locks further slowdown insertions.
A commonly accepted solution is to change the identity column to GUID so that insertions are written on random page every time. We can do this but changing IDs will trigger a need for the development cycle of migration scripts so that existing production data can be changed.
I tried another approach which seems to be working well in our load tests. Instead of changing to GUID, We are now generating IDs in a randomized pattern using following logic
SELECT #ModValue = (DATEPART(NANOSECOND, GETDATE()) % 14);
INSERT xxx(id)
SELECT NEXT VALUE FOR Sequence * (#ModValue + IIF(#ModValue IN (0,1,2,3,4,5,6), 100000000000,-100000000000))
It has eliminated PAGELATCH_EX and PAGELATCH_SH locks and our insertions are quite fast now. I also think GUID as PK of such a critical table is less efficient then a bigint ID column.
However, some of team members are sceptical on this as IDs with negative values that too generated on random basis is not a common solution. Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
I am wondering what the community’s take on this solution. If you could please point any disadvantage of approach suggested, that will be highly appreciated.
However, some of team members are skeptical on this as IDs with negative values that too generated on random basis is not a common solution
You have an uncommon problem, and so uncommon solutions might need to be entertained.
Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
Sure. The system as it exists now has a high (but not perfect) correlation between IDs and time. That is, in general a higher ID for a row means that it was created after one with a lower ID. So it's convenient to order by IDs as a stand-in for ordering by time. If that's something that they need to do (i.e. order the data by time), give them a way to do that in the new proposal. Conversely, play out the hypothetical scenario where you're explaining to your CTO why you didn't fix performance on this table for your end users. Would "so that our support personnel don't have to change the way they do things" be an acceptable answer? I know that it wouldn't be for me but maybe the needs of support outweigh the needs of end users in your system. Only you (and your CTO) can answer that question.

When adding a unique IDs to your table is there any advantage to making them random?

Let's say I had a table with N rows, but no existing columns that could act as a Primary Key.
I'd like to generate one (for my convenience and completeness).
I have a few options for doing this.
I could use a GUID
I could use a sequence and generate an integer for each one (e.g., populated 1 to N)
I could generate a random integer
(and many more)
I get that GUIDs have their advantages and disadvantages.
Is there some advantage to using a randomly generated integer over a sequential integer?
Any CRUD operations on an indexed column shouldn't be affected. And if you were doing a bulk load, I would temporarily turn off the index and then restore it afterwards
I can't see a reason, but I've come across a situation (in this case Oracle) where someone has done just that and I'm hoping its more than "What's a sequence?".
Since you're seeing a specific implementation that has chosen this approach, we can only speculate at what the original developer might have been thinking. That's always subject to error.
My guess is that the original developer was trying to avoid the issue where the right-most block in the index on the sequence-generated key becomes the resource that blocks many different sessions trying to do an insert. The "hot block" problem occurs because every session doing an insert needs to modify the data in the right-most block (assuming sequential keys) so Oracle needs to serialize access. In most systems, this isn't a big deal-- the amount of serialization needed is minimal and most systems don't have enough simultaneous insert operations for this to be a meaningful issue. But if you have a very high-volume system, particularly if you're running on a RAC cluster, those wait events can be meaningful. If you had this sort of issue, generating a random key would eliminate it by causing the various sessions to (generally) write to different blocks in the index.
Of course, generating random keys would not be the recommended approach even if you found yourself waiting on the right-most block of an index frequently. Oracle provides reverse-key indexes to take care of the hot block issue by indexing the data in reverse which distributes I/O across the blocks in the index. If you're licensed to use the partitioning option, a hash-partitioned index would be even better. For a more detailed discussion on reverse-key indexes, RAC, and mitigating hot block issues, here's a link to a related SO question.

GUID vs INT IDENTITY [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you like your primary keys?
I'm aware of the benefits of using a GUID, as well as the benefits of using and INT as a PK in a database. Considering that a GUID is in essence a 128 bit INT and a normal INT is 32 bit, the INT is a space saver (though this point is generally moot in most modern systems).
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
Kimberley Tripp (SQLSkills.com) has an article on using GUID's as primary keys. She advices against it because of the unnecessary overhead.
To answer your question:
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
I would use a GUID if my system would have an online/offline version that inside the offline version you can save data and that data is transferred back to the server one day during a synch. That way, you are sure that you won't have the same key twice inside your database.
We have Guids in our very complex enterprise software everywhere. Works smoothly.
I believe Guids are semantically more suitable to serve as identifiers. There is also no point in unnecessarily worrying about performance until you are faced with that problem. Beware premature optimization.
There is also an advantage with database migration of any sort. With Guids you will have no collisions. If you attempt to merge several DBs where ints are used for identity, you will have to replace their values. If these old values were used in urls, they will now be different following SEO hit.
Apart from being a poor choice when you need to synchronize several database instances, INT's have one drawback I haven't seen mentioned: inserts always occur at one end of the index tree. This increases lock contention when you have a table with a lot of movement (since the same index pages have to be modified by concurrent inserts, whereas GUID's will be inserted all over the index). The index may also have to be rebalanced more often if a B* tree or similar data structure is used.
Of course, int's are easier on the eye when doing manual queries and report construction, and space consumption may add up through FK usages.
I'd be interested to see any measurements of how well e.g. SQL Server actually handles insert-heavy tables with IDENTITY PK's.
the INT is a space saver (though this
point is generally moot in most modern
systems).
Not so. It may seem so at first glance, but note that the primary key of each table will be repeated multiple times throughout the database in indexes and as foreign key in other tables. And it will be involved in nearly any query containing its table - and very intensively when it's a foreign key used for a join.
Furthermore, remember that modern CPUs are very, very fast, but RAM speeds have not kept up. Cache behaviour becomes therefore increasingly important. And the best way to get good cache behaviour is to have smaller data sets. So the seemingly irrelevant difference between 4 and 16 bytes may well result in a noticeable difference in speed. Not necessarily always - but it's something to consider.
When comparing values such as Primary to Foreign key relationship, the INT will be faster. If the tables are indexed properly and the tables are small, you might not see much of a slow down, but you'd have to try it to be sure. INTs are also easier to read, and communicate with other people. It's a lot simpler to say, "Can you look at record 1234?" instead of "Can you look at record 031E9502-E283-4F87-9049-CE0E5C76B658?"
If you are planning on merging database at some stage, ie for a multi-site replication type setup, Guid's will save a lot of pain. But other than that I find Int's easier.
If the data lives in a single database (as most data for the applications that we write in general does), then I use an IDENTITY. It's easy, intended to be used that way, doesn't fragment the clustered index and is more than enough. You'll run out of room at 2 billion some records (~ 4 billion if you use negative values), but you'd be toast anyway if you had that many records in one table, and then you have a data warehousing problem.
If the data lives in multiple, independent databases or interfaces with a third-party service, then I'll use the GUID that was likely already generated. A good example would be a UserProfiles table in the database that maps users in Active Directory to their user profiles in the application via their objectGUID that Active Directory assigned to them.
Some OSes don't generate GUIDs anymore based on unique hardware features (CPUID,MAC) because it made tracing users to easy (privacy concerns). This means the GUID uniqueness is often no longer as universal as many people think.
If you use some auto-id function of your database, the database could in theory make absolutely sure that there is no duplication.
I always think PK's should be numeric where possble. Dont forget having GUIDs as a PK will probably mean that they are also used in other tables as foriegn keys, so paging and index etc will be greater.
An INT is certainly much easier to read when debugging, and much smaller.
I would, however, use a GUID or similar as a license key for a product. You know it's going to be unique, and you know that it's not going to be sequential.
I think the database also matters. From a MySQL perspective - generally, the smaller the datatype the faster the performance.
It seems to hold true for int vs GUID too -
http://kccoder.com/mysql/uuid-vs-int-insert-performance/
I would use GUID as PK only if this key bounds to similar value. For example, user id (users in WinNT are describes with GUIDs), or user group id.
Another one example. If you develop distributed system for documents management and different parts of system in different places all over the world can create some documents. In such case I would use GUID, because it guaranties that 2 documents created in different parts of distributed system wouldn't have same Id.

Advice on building a fast, distributed database

I'm currently working on a problem that involves querying a tremendous amount of data (billions of rows) and, being somewhat inexperienced with this type of thing, would love some clever advice.
The data/problem looks like this:
Each table has 2-5 key columns and 1 value column.
Every row has a unique combination of keys.
I need to be able to query by any subset of keys (i.e. key1='blah' and key4='bloo').
It would be nice to able to quickly insert new rows (updating the value if the row already exists) but I'd be satisfied if I could do this slowly.
Currently I have this implemented in MySQL running on a single machine with separate indexes defined on each key, one index across all keys (unique) and one index combining the first and last keys (which is currently the most common query I'm making, but that could easily change). Unfortunately, this is quite slow (and the indexes end up taking ~10x the disk space, which is not a huge problem).
I happen to have a bevy of fast computers at my disposal (~40), which makes the incredible slowness of this single-machine database all the more annoying. I want to take advantage of all this power to make this database fast. I've considered building a distributed hash table, but that would make it hard to query for only a subset of the keys. It seems that something like BigTable / HBase would be a decent solution but I'm not yet convinced that a simpler solution doesn't exist.
Thanks very much, any help would be greatly appreciated!
I'd suggest you listen to this podcast for some excellent information on distributed databases.
episode-109-ebays-architecture-principles-with-randy-shoup
To point out the obvious: you're probably disk bound.
At some point if you're doing randomish queries and your working set is sufficiently larger than RAM then you'll be limited by the small number of random IOPS a disk can do. You aren't going to be able to do better than a few tens of sub-queries per second per attached disk.
If you're up against that bottleneck, you might gain more by switching to an SSD, a larger RAID, or lots-of-RAM than you would by distributing the database among many computers (which would mostly just get you more of the last two resources)

How many database indexes is too many?

I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.

Resources