I have a SQL 2005 database I've inherited, with a table that has grown to about 17 million records over the course of about 15 years, and is now horribly slow.
The table layout looks about like this:
id_column = nvarchar(20),indexed, not unique
column2 = nvarchar(20), indexed, not unique
column3 = nvarchar(10), indexed, not unique
column4 = nvarchar(10), indexed, not unique
column5 = numeric(8,0), indexed, not unique
column6 = numeric(8,0), indexed, not unique
column7 = nvarchar(20), indexed, not unique
column8 = nvarchar(10), indexed, not unique
(and about 5 more columns that look pretty much the same, not indexed)
The 'id' field is a value entered in a front-end application by the end-user.
There are no defined primary keys, and no columns that can be combined to make a unique row (unless all columns are combined). The table actually is a 'details' table to another table, but there are no constraints ensuring referential integrity.
Every column is heavily used in 'where' clauses in queries, which is why I assume there's an index on every one, or perhaps a desperate attempt to speed things up by another DBA.
Having said all that, my question is: could adding a clustered index do me any good at this point?
If I did add a clustered index, I assume it would have to be a new column, ie., an identity column? Basically, is it worth the trouble?
Appreciate any advice.
I would say only add the clustered index if there is a reasoning for needing it. So ask these questions;
Does the order of the data make sense?
Is there sequential value to the way the data is inserted?
Do I need to use a feature that requires it have a clustered index, such as full text index?
If the answer to these questions is all "No" than a clustered index might not be of any additional help over a good non-clustered index strategy. Instead you might want to consider how and when you update statistics, when you refresh the indexes and whether or not filtered indexes make sense in your situation. Looking at the table you have as an example it tough to say, but maybe it makes sense to normalize the table further and use numeric keys instead of nvarchar.
http://www.mssqltips.com/sqlservertip/3041/when-sql-server-nonclustered-indexes-are-faster-than-clustered-indexes/
the article is a great example of when non-clustered indexes might make more sense.
I would recommend adding a clustered index, even if it's an identity column for 3 reasons:
Assuming that your existing queries have to go through the entire table every time, a clustered index scan is still faster than a table scan.
The table is a child to some other table. With some extra works, you can use the new child_id to join against the parent table. This enables clustered index seek, which is a lot faster than scan in some cases.
Depend on how they are setup, the existing indices may not do much good. I've come across some terrible indices that cover 1 column each, or indices that don't include the appropriate columns, causing costly Key Lookups operations. Check your index stats to see if they are being used at all.
Related
I have a table which has over 25 millions rows. The table gets bigger every day (roughly 35 000 rows). I created nonclustered index on 2 columns - date and debt_id (these columns are used most frequently in WHERE clouse), each debt_id occurs only once in each date). So the table is still the heap because it doesn't have a clustered index. Do you think that it would be a good idea to add identity column (1,1) and create clustered index on it? Or what do you think I should do to boost performance on this table?
If your two columns are unique in any case, you can use them as clustered index.
Most important: A clustered index should not change its values, and new rows should be appended in the correct order.
The time of insertion as DATETIME2 as the first column of your clustered index is a good bet here.
The uniqueness must be guaranteed by the combination of this value and the debt_id you've mentioned.
Assuming that neither the time of insertion nor the debt_id are changing data, this looks like a very good combined PK.
Otherwise your clustered index might get fragmented. This would make things even worse... (The main reason why UNIQUEIDENTIFIER IDs tend to be very bad as clustered PK. Regularly running index repair scripts can be an acceptable workaround.)
A non-fragmented clustered index will speed up things, as long as your query filters on both columns (at least the first one must be involved).
You can add more indexes, you might even INCLUDE heavily needed values to them.
Other indexes will use the clustered index as lookup (might need recreation after building the clustered one). This helps if the clustered index is well performing and can make things worse if not.
So I'd say: If the above is true in your case, an additional ID IDENTITY is of little help. This will add one more step to each query, as the Query will need an additional lookup. But, if the index is prone to fragmentation, I'd rather add the additional ID. And finally, to cite George Menoutis in comments
Well, I certainly can't answer this; it is a deep design choice with
loads of pros, loads of cons, and loads of discussion
Without knowing your database and your needs this is pure guessing...
In my system I have temporary entities that are created based on rules stored in my database, and the entities are not persisted.
Now, I need is to store information about these entities, and because they are created based on rules and are not stored, they have no ID.
I came up with a formula to generate an ID for these temp entities based on the rule that was used to generate them: id = rule id + "-" + entity index in the rule. This formula generates unique strings of the form 164-3, 123-0, 432-2, etc...
My question is how should I build my table (regarding primary key and clustered index) when my keys have no relation or order? Keep in mind that I will only (99.9% of the time) query the table using the id mentioned above.
Options I thought about after much reading, but don't have the knowledge to determine which is better:
1) primary key on a varchar column with clustered index. -According to various sources, this would be bad because of fragmentation and the wideness of the key. Also their format is pretty weird for sorting.
2) primary key on varchar column without clustered index (heap table). -Also a bad idea according to various sources due to indexing and fragmentation issues.
3) identity int column with clustered index, and a varchar column as primary key with unique index. -Can't really see the benefit of the surogate key here since it would mainly help with range queries and ordering and I would never query the table based on this key because it would be unknown at all times.
4) 2 columns composite key: rule id + rule index columns.
-Now I don't have strings but I have two columns that will be copied to FKs and non clustered indexes. Also I'm not sure what indexes I would use in this case.
Can anybody shine a light here? Any help is appreciated.
--Edit
I will perform more selects than inserts;
I will perform more inserts than updates;
All selects will include at least rule id;
If I use a surogate primary key, and a unique index on (rule id, index), then I can use the surogate for subsequent operations after retrieving data by rule id, which would be faster. Also, inserts would be faster.
However, because the data will be stored according to the surogate key, I might have records that have the same rule id, but different index, stored quite far from each other on disk, which means even with an index on rule id, retrieving the data could be kinda slow.
If I use (rule id, index) as clustered primary key, rows with same rule id would be stored close to each other, and selecting data by rule id would be efficient enough. However, I suspect inserts would be slow.
Is the rationale above correct?
Using a heap is generally a bad idea unless proven otherwise. Even so, you will need a very solid reason for not having a clustered index (any one will make things better, even on identity column).
Storing this key in a single column is okay; if you want natural sorting, you can pad your numbers with zeroes, for example. However, this will widen the key.
Having a composite primary key (and, subsequently, foreign keys) is completely acceptable, especially when dealing with natural keys, like the one you have. This will give you the narrowest possible key - int + int or some such - while eliminating the sorting issue at the same time. I would recommend to make this PK clustered to reduce additional key lookups.
Fragmentation here will not be a big issue; at least, no bigger than with any other indexing decision. Any index built on such a key will be prone to fragmentation, clustered or no. In any case, your DBA should know how to keep an index such as this in top form.
Regarding the order of columns in the index, the following rules usually apply:
If partial key match will take place (filtering by one part of the key but not by the other) the one which is used most often should go first;
If No.1 isn't applicable and all parts of the key used in all queries, the column with the highest cardinality should go first.
The order of remaining columns (if there are more than 1) isn't of much importance because SQL Server only creates distribution statistics for the first column in a composite index. However, it is a good idea to list them in order of decreasing cardinality.
EDIT: Seeing your update with additional details, here are the most suitable options. Suppose your table looks like this:
-- Sample table
create table dbo.TempEntities (
RuleId int not null,
IndexId int not null,
-- Remaining columns listed here
EntityData xml not null
);
go
From here, the most straightforward way is to use the natural key as a clustered index:
-- Option 1 - natural clustered index
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (RuleId, IndexId);
go
However, if you have any child tables that would reference this one, it might not be the most convenient solution, because natural keys are prone to updates, which creates a mess where you could avoid it. Instead, a surrogate key can be introduced, like this:
-- Option 2 - surrogate clustered, natural nonclustered
alter table dbo.TempEntities add Id bigint identity(1,1) not null;
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (Id);
alter table dbo.TempEntities
add constraint UQ_TempEntities_RuleIdIndexId unique (RuleId, IndexId);
go
It makes sense to have the surrogate PK clustered, because it will result in much less page splits, making inserts faster (despite having one index more compared to Option 1). Without any intimate knowledge of your queries, this is probably the most balanced solution.
Shuffling the clustered attribute between surrogate and natural keys has mostly academic value and can only make difference on a high-load system with hundreds of inserts happening every second on 24*7 schedule. If your system is indeed as such, please seek a professional consultant who will analyse your queries and provide the solution tailored to your situation.
This question already has answers here:
What are the differences between a clustered and a non-clustered index?
(13 answers)
Closed 7 years ago.
I need to add proper index to my tables and need some help.
I'm confused and need to clarify a few points:
Should I use index for non-int columns? Why/why not
I've read a lot about clustered and non-clustered index yet I still can't decide when to use one over the other. A good example would help me and a lot of other developers.
I know that I shouldn't use indexes for columns or tables that are often updated. What else should I be careful about and how can I know that it is all good before going to test phase?
A clustered index alters the way that the rows are stored. When you create a clustered index on a column (or a number of columns), SQL server sorts the table’s rows by that column(s). It is like a dictionary, where all words are sorted in alphabetical order in the entire book.
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. It creates a completely different object within the table that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data. It is like an index in the last pages of a book, where keywords are sorted and contain the page number to the material of the book for faster reference.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way!
One rule of thumb I would apply is this: any "regular" table (one that you use to store data in, that is a lookup table etc.) should have a clustering key. There's really no point not to have a clustering key. Actually, contrary to common believe, having a clustering key actually speeds up all the common operations - even inserts and deletes (since the table organization is different and usually better than with a heap - a table without a clustering key).
Kimberly Tripp, the Queen of Indexing has a great many excellent articles on the topic of why to have a clustering key, and what kind of columns to best use as your clustering key. Since you only get one per table, it's of utmost importance to pick the right clustering key - and not just any clustering key.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc
You should be using indexes to help SQL server performance. Usually that implies that columns that are used to find rows in a table are indexed.
Clustered indexes makes SQL server order the rows on disk according to the index order. This implies that if you access data in the order of a clustered index, then the data will be present on disk in the correct order. However if the column(s) that have a clustered index is frequently changed, then the row(s) will move around on disk, causing overhead - which generally is not a good idea.
Having many indexes is not good either. They cost to maintain. So start out with the obvious ones, and then profile to see which ones you miss and would benefit from. You do not need them from start, they can be added later on.
Most column datatypes can be used when indexing, but it is better to have small columns indexed than large. Also it is common to create indexes on groups of columns (e.g. country + city + street).
Also you will not notice performance issues until you have quite a bit of data in your tables. And another thing to think about is that SQL server needs statistics to do its query optimizations the right way, so make sure that you do generate that.
A comparison of a non-clustered index with a clustered index with an example
As an example of a non-clustered index, let’s say that we have a non-clustered index on the EmployeeID column. A non-clustered index will store both the value of the
EmployeeID
AND a pointer to the row in the Employee table where that value is actually stored. But a clustered index, on the other hand, will actually store the row data for a particular EmployeeID – so if you are running a query that looks for an EmployeeID of 15, the data from other columns in the table like
EmployeeName, EmployeeAddress, etc
. will all actually be stored in the leaf node of the clustered index itself.
This means that with a non-clustered index extra work is required to follow that pointer to the row in the table to retrieve any other desired values, as opposed to a clustered index which can just access the row directly since it is being stored in the same order as the clustered index itself. So, reading from a clustered index is generally faster than reading from a non-clustered index.
In general, use an index on a column that's going to be used (a lot) to search the table, such as a primary key (which by default has a clustered index). For example, if you have the query (in pseudocode)
SELECT * FROM FOO WHERE FOO.BAR = 2
You might want to put an index on FOO.BAR. A clustered index should be used on a column that will be used for sorting. A clustered index is used to sort the rows on disk, so you can only have one per table. For example if you have the query
SELECT * FROM FOO ORDER BY FOO.BAR ASCENDING
You might want to consider a clustered index on FOO.BAR.
Probably the most important consideration is how much time your queries are taking. If a query doesn't take much time or isn't used very often, it may not be worth adding indexes. As always, profile first, then optimize. SQL Server Studio can give you suggestions on where to optimize, and MSDN has some information1 that you might find useful
faster to read than non cluster as data is physically storted in index order
we can create only one per table.(cluster index)
quicker for insert and update operation than a cluster index.
we can create n number of non cluster index.
I am having a very small tables with at most 5 records that holds some labels. I am using Postgres.
The structure is as follows:
id - smallint
label - varchar(100)
The table will be used mainly to reference the rows from other tables. The question is if it's really necessary to have a primary key on id or to have just an index on the id or have them both?
I did read about indexes and primary keys and I understand that this depends quite a lot on what's the table going to be used for:
Tables with no Primary Key
Edit: I was going to ask about having a primary key or an index or have them both. I edited the question.
It is always good practice to have a primary key column. The typical scenario it is needed is when you want to update or delete a row, having a PK makes it much easier and safer.
Yes, a primary key is not only good practice -- it's crucial. A table that lacks a unique key fails to be in First Normal Form.
You must declare a PRIMARY KEY or UNIQUE constraint if you want other tables to reference this one with a foreign key.
In most RDBMS brands, both PRIMARY KEY and UNIQUE constraints implicitly create an index on the column(s). If it doesn't do this implicitly, you may be required to define the index yourself before you can declare the constraint.
Yes, you will need a primary key on the id field, since you do not want two labels that share the same id.
You also want an index, to speed up the search/lookup process in this table (although for small tables there is less performance gain). The sequence will just help you fill in the next ID; it does not prevent you from changing a previous value into one that already exists.
There are very little reasons for creating an index instead of a primary key. AS Bill Karwin said, you won't save resources at all. And, as you may have already guessed, there is no need at all to create a new index if you have the primary key.
In some cases it may be hard to find a key candidate. But it doesn't seems to be the case and it clearly goes against some good practices.
By the way. As your table is so small most queries will rather use a full table scan even if there is an index. Don't worry you see a full table scan.
From the developer's point of view, PRIMARY KEY is just a combination of NOT NULL and UNIQUE INDEX on one of the columns.
A UNIQUE INDEX is good if:
You need to enforce uniqueness. Using index is the most efficient way to do that.
You need to perform a SELECT, UPDATE or DELETE with a WHERE condition that is selective on the indexed field, that is number of rows affected by the query is much less that total number of the rows (say, 10 rows of 2,000,000).
A UNIQUE INDEX is bad if:
You don't need uniqueness on this field, of course :) But you'll better have one unique index in the table, for each record to be identifiable.
You need fast INSERTS
You need fast UPDATE's and DELETE's of the indexed value with a WHERE condition that is not selective on the indexed field, that is number of rows affected by the query is comparable to the total number of the rows (say, 1,500,000 rows of 2,000,000).
Given that you are going to have a little table, I'd advice you to create a PRIMARY KEY on it.
I am working on a database that usually uses GUIDs as primary keys.
By default SQL Server places a clustered index on primary key columns. I understand that this is a silly idea for GUID columns, and that non-clustered indexes are better.
What do you think - should I get rid of all the clustered indexes and replace them with non-clustered indexes?
Why wouldn't SQL's performance tuner offer this as a recommendation?
A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.
Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.
You almost certainly want to establish a clustered index on every table in your database.
If a table does not have a clustered index it is what is referred to as a "Heap" and performance of most types of common queries is less for a heap than for a clustered index table.
Which fields the clustered index should be established on depend on the table itself, and the expected usage patterns of queries against the table. In almost every case you probably want the clustered index to be on a column or a combination of columns that is unique, i.e., (an alternate key), because if it isn't, SQL will add a unique value to the end of whatever fields you select anyway. If your table has a column or columns in it that will be frequently used by queries to select or filter multiple records, (for example if your table contains sales transactions, and your application will frequently request sales transactions by product Id, or even better, a Invoice details table, where in almost every case you will be retrieving all the detail records for a specific invoice, or an invoice table where you often retrieve all the invoices for a particular customer... This is true whether you will be selected large numbers of records by a single value, or by a range of values)
These columns are candidates for the clustered index. The order of the columns in the clustered index is critical.. The first column defined in the index should be the column that will be selected or filtered on first in expected queries.
The reason for all this is based on understanding the internal structure of a database index. These indices are called balanced-tree (B-Tree) indices. they are kinda like a binary tree, except that each node in the tree can have an arbitrary number of entries, (and child nodes), instead of just two. What makes a clustered index different is that the leaf nodes in a clustered index are the actual physical disk data pages of the table itself. whereas the leaf nodes of the non-clustered index just "point" to the tables' data pages.
When a table has a clustered index, therefore, the tables data pages are the leaf level of that index, and each one has a pointer to the previous page and the next page in the index order (they form a doubly-linked-list).
So if your query requests a range of rows that is in the same order as the clustered index... the processor only has to traverse the index once (or maybe twice), to find the start page of the data, and then follow the linked list pointers to get to the next page and the next page, until it has read all the data pages it needs.
For a non-clustered index, it has to traverse the index once for every row it retrieves...
NOTE: EDIT
To address the sequential issue for Guid Key columns, be aware that SQL2k5 has NEWSEQUENTIALID() that does in fact generate Guids the "old" sequential way.
or you can investigate Jimmy Nielsens COMB guid algotithm that is implemented in client side code:
COMB Guids
The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
However, with integer-based clustered indexes, the integers are normally sequential (like with an IDENTITY spec), so they just get added to the end an no data needs to be moved around.
On the other hand, clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. If you need to be able to SELECT records quickly, then use a clustered index... the INSERT speed will suffer, but the SELECT speed will be improved.
While clustering on a GUID is normally a bad idea, be aware that GUIDs can under some circumstances cause fragmentation even in non-clustered indexes.
Note that if you're using SQL Server 2005, the newsequentialid() function produces sequential GUIDs. This helps to prevent the fragmentation problem.
I suggest using a SQL query like the following to measure fragmentation before making any decisions (excuse the non-ANSI syntax):
SELECT OBJECT_NAME (ips.[object_id]) AS 'Object Name',
si.name AS 'Index Name',
ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
ips.page_count AS 'Pages',
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats
(DB_ID ('MyDatabase'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE si.object_id = ips.object_id
AND si.index_id = ips.index_id
AND ips.index_level = 0;
If you are using NewId(), you could switch to NewSequentialId(). That should help the insert perf.
Yes, there's no point in having a clustered index on a random value.
You probably do want clustered indexes SOMEWHERE in your database. For example, if you have a "Author" table and a "Book" table with a foreign key to "Author", and if you have a query in your application that says, "select ... from Book where AuthorId = ..", then you would be reading a set of books. It will be faster if those book are physically next to each other on the disk, so that the disk head doesn't have to bounce around from sector to sector gathering all the books of that author.
So, you need to think about your application, the ways in which it queries the database.
Make the changes.
And then test, because you never know...
As most have mentioned, avoid using a random identifier in a clustered index-you will not gain the benefits of clustering. Actually, you will experience an increased delay. Getting rid of all of them is solid advice. Also keep in mind newsequentialid() can be extremely problematic in a multi-master replication scenario. If database A and B both invoke newsequentialid() prior to replication, you will have a conflict.
Yes you should remove the clustered index on GUID primary keys for the reasons Galwegian states above. We have done this on our applications.
It depends if you're doing a lot of inserts, or if you need very quick lookup by PK.