I am wondering about the performance impact of using a non-sequential UUID as the primary key in a table that will become quite large in PosgreSQL.
In DBMS's that use clustered storage for table records it is a given that using a UUID is going to increase the cost of inserts due to having to read from disk to find the data page into which to perform the insert, once the table is too big to hold in memory. As I understand it, Postgres does not maintain row clustering on inserts, so I imagine that in Postgres using a UUID PK does not hurt the performance of that insert.
But I would think that it makes the insert into the index that the primary key constraint creates much more expensive once the table is large, because it will have to constantly be read from disk to update the index on insertion of new data. Whereas with a sequential key the index will only be updated at the tip which will always be in memory.
Assuming that I understand the performance impact on the index correctly, is there any way to remedy that or are UUIDs simply not a good PK on a large, un-partitioned table?
As I understand it, Postgres does not maintain row clustering on inserts
Correct at the moment. Unfortunately.
so I imagine that in Postgres using a UUID PK does not hurt the performance of that insert.
It still does have a performance cost because of the need to maintain the PK, and because the inserted tuple is bigger.
The uuid is 4 times as wide as a typical 32-bit integer synthetic key, so the row to write is 12 bytes bigger and you can fit fewer rows into a given amount of RAM
The b-tree index that implements the primary key will be 4x as large (vs a 32-bit key), taking longer to search and requiring more memory to cache. It also needs more frequent page splits.
Writes will tend to be random within indexes, not appends to hot, recently accessed rows
is there any way to remedy [the performance impact on the index] or are UUIDs simply not a good PK on a large, un-partitioned table?
If you need a UUID key, you need a UUID key. You shouldn't use one if you don't require one, but if you cannot rely on a central source of synthetic keys and there is no suitable natural key to use, it's still the way to go.
Partitioning won't help much unless you can confine writes to one partition. Also, you won't be able to usefully use constraint exclusion on searches for the key if writing only to one partition at a time, so you'll still have to search all the partitions' indexes for a key when doing queries. I can only see it being useful if your UUID forms part of a composite key and you can partition on the other part of the composite key.
It should be mentioned that you will get more WALs generated if you have btree index on UUID column with full_page_writes option enabled. This happens because of UUID randomness - the values are not sequential so each insert is likely to touch completely new leaf index leaf page. You can read more in On the impact of full-page writes article.
Related
I have just taken over a database which has around 2200 tables. Over 2000 of these have no clustered index (some have no indexes at all).
All of the tables have been configured to use a GUID as the uniqueidentifier.
Just looking at the query plans, I can see that there are many table scans occurring. Most searches use the uniqueidentifier to search on.
I am wondering if it is better to have a clustered index on the GUID than not to have a clustered index at all. I imagine that a clustered index on a 16-byte column will inevitably lead to fragmentation.
I could arguably cluster on other columns but the majority of searches tend to search by or join via the GUIDS.
Any advice would be very much welcomed. I've never seen so many GUID's!!
In generally, I would recommend having an identity column as the primary key and use that for clustering. This is also a better choice for joins.
Why? First, identity keys are generally shorter that unique ids. So, foreign key references and indexes are smaller.
More importantly, inserts would always go at the "end" of the table. When using GUIDs, inserts are often going to cause fragmentation. If you are inserting rows, I would say that a secondary index on the GUID might be better than a clustered index (the fragmentation is only in the index).
With 2000 tables, I doubt you will change the structure. You can ameliorate the fragmentation using newsequentialid().
GUID column with random values usually is not the best choice for a clustered index because it could be the root cause of an index fragmentation:
Read ahead opportunity of the database won't be effective;
The cost of insert operations will be too expensive, because in this case you'll got lots of page split overhead;
There are 3 ways how you can live with that:
Schedule planning index reorganizing and rebuilding which will reduce index fragmentation and improve your statistics automatically;
Use
newsequantialid for generating values of this column;
Generate
GUID value sequantialy outside of the database (Guid.Comb
Identifier is a great example of solving this issue in
NHibernate).
This is a really a comment on your question to Gordon's good answer:
Firstly, don't forget to check the index DMVs to see which ones are being used (or not used) and have a look at the expensive query plans in the cache to focus on the tables and queries that will be causing most pain. I would expect that many of those 2200 tables are relatively small & the queries are able to look up pretty quickly even from the guid clustered index.
For those tables that aren't clustered, clustering on the guid would reduce fragmentation, since it forces all the data for the table to be colocated rather than allowing pages to be put in the next free extent & spreading tables all over the disk. This should make some of the I/O more efficient.
Check you have a low enough fillfactor so that your regular index rebuilds avoid page splitting in advance, although it will also be workload dependent (OLTP vs DW and read/write ratio of table)
If you have applications that are doing explicit column selects/inserts then you may be able to add an identity column without breaking anything. That allows you cluster around the identity & add an index to the guid. Whether this really helps depends on the relative (in)efficency of the new plans.
You could consider clustering around a non-guid field where queries will lookup against it fairly regularly (eg, a date range) and index the guid separately.
You'd have to look at the queries & relative performance for that more closely.
Is there any considerations to define keys for table that has lot of records already and most of operation that are operated on it are Insert ?
Key definition ultimately comes down to how you can uniquely and efficiently identify any specific row in a table. If a business key value fulfills that requirement, then it is a suitable candidate. An ideal key is also skinny. A GUID is horrible for this (IMHO) because it is far larger than it needs to be.
If insert performance is the most important priority and a suitable business key is not available, you can use an integer based identity key. If you expect more than 2.1 billion records within a few years, use bigint (9 quintillion records) instead.
Keep in mind that every index you make on the table will always include the PK. Having a skinny PK can make your indexes more efficient, using less storage, memory and CPU.
Insert speed is affected by the clustered index sort order as well as the number and sort order of all non-clustered indexes on the table. Column-store indexes are not sorted and have minimal overhead on inserts.
If you have a PK that store ID-number is more heavy then auto increases number, therefore when you define key keep in mind that it bather to create another column of PK for auto increases number.
Is there any difference in performance (in terms of inserting/updating & querying) a table if the primary key is a single column (e.g., a GUID generated for every row) or multiple columns (e.g., a foreign key GUID + an offset number)?
I would assume querying speeds should be quicker if anything with multi-column primary keys, however I would imagine inserting would be slower due to a slightly more complicated unique check? I also imagine the data types of a multi-column primary key could also matter (e.g., if one of the columns was a DateTime type it would add complexity). These are just my thoughts to invoke answers & discussion (hopefully!) and are not fact based.
I realise there are some other questions covering this topic, but I'm wondering about performance impacts rather than management/business concerns.
You will be affected more by (each) component of the key being (a) variable length and (b) the width [wide instead of narrow columns], than the number of components in the key. Unless MS have broken it again in the latest release (they broke Heaps in 2005). Datatype does not slow it down; the width, and particularly variable length (any datatype) does. Note that a fixed len column is made variable if it is set to Nullable. Variable len columns in indices is bad news, because a bit of "unpacking" has to be performed on every access, to get at the data.
Obviously, keep indexed columns as narrow as possible, using fixed, and not Nullable columns only.
In terms of number of columns in a compound key, sure one column is faster than seven, but not that much: three fat wide variable columns are much slower than seven thin fixed columns.
GUID is of course a very fat key; GUID plus anything else is very very fat; GUID Nullable is Guiness material. Unfortunately it is the knee-jerk reaction to solving the IDENTITY problem, which in turn is a consequence of not having chosen good natural relational keys. So you are best advised to fix the real problem at the source, and choose good natural keys; avoid IDENTITY; avoid GUID.
Experience and performance tuning, not conjecture.
It depends on your access patterns, read/write ratio and whether (possibly most importantly) the clustered index is defined on the Primary Key.
Rule of thumb is make your primary key as small as possible (32 bit int) and define the clustered index on a monotonically increasing key (think IDENTITY) where possible, unless you have range searches that form a large proportion of the queries against that table.
If your application is write intensive, and you define the clustered index on the GUID column you should note:
All non-clustered indexes will
contain the clustered index key and will therefore be larger. This may have a negative effect of performance if there are many NC indexes.
Unless you are using an 'ordered'
GUID (such as a COMB or using
NEWSEQUENTIALID()), your inserts
will fragment the index over time. This means
you need a regular index rebuild and
possibly increasing the amount of
free space left in pages (fill
factor)
Because there are many factors at work (hardware, access patterns, data size), I suggest you run some tests and benchmark your particular circumstances..
It depends on the indexing and storage in each case. All other things being equal, the choice of primary key is irrelevant as far as performance is concerned. The choice of indexes and other storage options would be the deciding factor.
If your situation is going to be geared towards a higher number of inserts, then the smaller footprint possible, the better.
There are two things you need to separate, the concept of the primary key at the database level, and the concept of the key your application uses.
Why do you need a GUID? Are you going to be inserting into multiple database server, and then combining the information into one centralized database?
If that is the case then my recommendation is an identity followed by a guid. Clustered index on the identity, and Unique Non clustered on the GUID. If you use the GUID as a Clustered index, then your data inserts will be all over the place. Meaning your data will not be inserted sequentially, and this causes performance problems as your system will be inserting and moving pages around randomly.
Having your data inserted nice in an ordered faction, thanks to the identity, is the way to go. You can leave the sorting to the index structure( the nonclusered unique containing the GUID), which is a much more efficient structure to sort than using the table data.
I'm not a DBA ("Good!", you'll be thinking in a moment.)
I have a table of logging data with these characteristics and usage patterns:
A datetime column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique
Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
No updates at all
Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
Infrequent selects using other columns as the criteria (and not including the timestamp column)
A good amount of data, but nowhere near enough that I'm worried much about storage space
Additionally, there is currently a daily maintenance window during which I could do table optimization.
I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.
I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).
But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.
All of this suggests to me that I should:
Put a clustered index on the timestamp column with a 100% fill-factor
Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
Schedule the bulk deletes to occur during the daily maintenance interval
Schedule a rebuild of the clustered index to occur immediately after the bulk delete
Relax and get out more
Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?
Thanks in advance.
Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
She mentions (about in the middle of the article):
Inserts are faster in a clustered
table (but only in the "right"
clustered table) than compared to a
heap. The primary problem here is that
lookups in the IAM/PFS to determine
the insert location in a heap are
slower than in a clustered table
(where insert location is known,
defined by the clustered key). Inserts
are faster when inserted into a table
where order is defined (CL) and where
that order is ever-increasing.
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.
I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance
There's two "best practice" ways to index a high traffic logging table:
an integer identity column as a primary clustered key
a uniqueidentifier colum as primary key, with DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.
The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.
Is this true that Update SQL Query is slow because of Clustered index??????
You would be better off saying 'slower' rather than 'slow'. When data is written to a clustered index, and it doesn't go at the very end of the table, data needs to be joggled around in order to fit it in, in the same way that adding a CD into a big stack of alphabetised CD is a lot slower than just sticking it on the top.
If you don't have any clustered indexes at all, then what you have is termed a "heap". You also have a heap of trouble, since the order of the data in your table is random - and selecting data from the table will be slow. That may be OK if you're doing many more INSERTs than you are SELECTs, but usually that's not the case.
Whether the clustered index makes INSERTs slower or not depends on:
The fill factor of the table (i.e. whether there are enough gaps in the data to allow new data to be inserted without moving everything around).
What columns are chosen as the cluster key.
If you're using an identity column as the cluster key, then you may find that insert performance is perfectly fine, since new entries are always being added on the end. The same may apply to a datetime column if using the current date (which of course also keeps increasing).
You need to keep the size of the cluster key small, since that's the index into the data that's stored in every other index. For example, if your cluster key consists of 3 ints and a datetime, then each entry in all your other indexes will include all that data in addition to whatever it was that you tried to index. For this reason, an identity column is actually a pretty good choice of cluster key since it's nice & small.
The perfect cluster key in any situation can only be chosen with a good deal of thought and a lot of testing (with realistically large data sets). Having a good cluster key can make a huge difference to SELECT performance - which normally outweighs any degradation in INSERT performance.
Define slow, ofcourse the clustered index will always be slower than a non-clustered index...
Insertion and updates are slower because of clustered indexes (particularly on huge tables) - but selects are way faster.
Making the index non-clustered usually improves inserts and updates performance retaining selection performance (selects are often less performant with a non-clustered index compared with a clustered index but something's gotta give).
A clustered index dictates how a table is physically stored on disk, and so updating a table with a clustered index may require that significant parts of the table be moved to make space for the new record, and that's slow.
You can mitigate the problem by setting an appropriate fillfactor for your indexes. It's not quite so bad that you have to re-jigger the whole table when you add a record to the middle; it's usually just a few pages. Fillfactor determines how much of each page is filled before creating a new page, and how much to leave as wiggle room for new insertions. A lower fillfactor on an index will leave more space for new records and therefore give faster insert times on average, at the cost of more disk space and more pages and therefore slower reads. But if you're doing a lot more updating than reading it may be worth it.