I have read that one of the tradeoffs for adding table indexes in SQL Server is the increased cost of insert/update/delete queries to benefit the performance of select queries.
I can conceptually understand what happens in the case of an insert because SQL Server has to write entries into each index matching the new rows, but update and delete are a little more murky to me because I can't quite wrap my head around what the database engine has to do.
Let's take DELETE as an example and assume I have the following schema (pardon the pseudo-SQL)
TABLE Foo
col1 int
,col2 int
,col3 int
,col4 int
PRIMARY KEY (col1,col2)
INDEX IX_1
col3
INCLUDE
col4
Now, if I issue the statement
DELETE FROM Foo WHERE col1=12 AND col2 > 34
I understand what the engine must do to update the table (or clustered index if you prefer). The index is set up to make it easy to find the range of rows to be removed and do so.
However, at this point it also needs to update IX_1 and the query that I gave it gives no obvious efficient way for the database engine to find the rows to update. Is it forced to do a full index scan at this point? Does the engine read the rows from the clustered index first and generate a smarter internal delete against the index?
It might help me to wrap my head around this if I understood better what is going on under the hood, but I guess my real question is this. I have a database that is spending a significant amount of time in delete and I'm trying to figure out what I can do about it.
When I display the execution plan for the deletion, it just shows an entry for "Clustered Index Delete" on table Foo which lists in the details section the other indices that need to be updated but I don't get any indication of the relative cost of these other indices.
Are they all equal in this case? Is there some way that I can estimate the impact of removing one or more of these indices without having to actually try it?
Nonclustered indexes also store the clustered keys.
It does not have to do a full scan, since:
your query will use the clustered index to locate rows
rows contain the other index value (c3)
using the other index value (c3) and the clustered index values (c1,c2), it can locate matching entries in the other index.
(Note: I had trouble interpreting the docs, but I would imagine that IX_1 in your case could be defined as if it was also sorted on c1,c2. Since these are already stored in the index, it would make perfect sense to use them to more efficiently locate records for e.g. updates and deletes.)
All this, however has a cost. For each matching row:
it has to read the row, to find out the value for c3
it has to find the entry for (c3,c1,c2) in the nonclustered index
it has to delete the entry from there as well.
Furthermore, while the range query can be efficient on the clustered index in your case (linear access, after finding a match), maintenance of the other indexes will most likely result in random access to them for every matching row. Random access has a much higher cost than just enumerating B+ tree leaf nodes starting from a given match.
Given the above query, more time is spent on the non-clustered index maintenance - the amount depends heavily on the number of records selected by the col1 = 12 AND col2 > 34
predicate.
My guess is that the cost is conceptually the same as if you did not have a secondary index but had e.g. a separate table, holding (c3,c1,c2) as the only columns in a clustered key and you did a DELETE for each matching row using (c3,c1,c2). Obviously, index maintenance is internal to SQL Server and is faster, but conceptually, I guess the above is close.
The above would mean that maintenance costs of indexes would stay pretty close to each other, since the number of entries in each secondary index is the same (the number of records) and deletion can proceed only one-by-one on each index.
If you need the indexes, performance-wise, depending on the number of deleted records, you might be better off scheduling the deletes, dropping the indexes - that are not used during the delete - before the delete and adding them back after. Depending on the number of records affected, rebuilding the indexes might be faster.
Related
I have the following fairly simple query that returns about 1 million rows (I've left out columns as they are just for output), but the query plan doesn't seem to want to use the index and wants me to create one:
SELECT [SAU]
,nr.[Headend]
,[Source]
,[Destination]
,[FibreHop]
,[CableRef]
,[CableSectionRef]
,[nNGAFibres]
,[nEthFibres]
,[FromID]
,[ToID]
,[FromIDTerm]
,[ToIDTerm],Reversed
,#Now
FROM [NodeRouting] nr
join [TargetHeadends] tex ON nr.Headend=tex.Headend
The index is:
CREATE NONCLUSTERED INDEX [NodeRouting_Headend] ON [NodeRouting]
(
[Headend] ASC
)
the other table Headend is the PK
The query plan is this:
If I give it a hint to use the index already created (non-unique, non-clustered) on the id field:
join [TargetHeadends] tex ON nr.id=tex.id (index=NodeRouting_Headend)
It changes to this:
The estimated number of rows, btw, in reality is the first 966,000. The RID 761,000 is a few hundred thousand short and the operator cost seems a lot higher
One thing that is striking me as a little odd, is in the first example where it chose to not use the index it says this:
Missing Index (impact 99): CREATE NONCLUSTERED INDEX <NAME> ON NodeRouting(id) include (....)
CREATE NONCLUSTERED INDEX [<Name>]
ON [NodeRouting] ([Headend])
INCLUDE ([SAU],[Source],[Destination],[FibreHop],[CableRef],[CableSectionRef],[nNGAFibres],[nEthFibres],[FromID],[ToID],[FromIDTerm],[ToIDTerm],[Reversed])
I appreciate i'm returning more columns than in the index but would have thought the index would have still been used without the INCLUDE?
Indexes don't always help and they should not need to be forced into use. For example, for small tables a scan will be used because it's less work because of index overhead. Don't force the use of the index.
For a large table, an index helps when it is "selective" and the query is selective. It will get a few records quickly. It does not get a lot of records quickly. If the index is more than about 5% selective, then it might be used. If not, a scan might be faster than using the non-selective index.
If you are returning all the records, then there is no selectivity. A scan is going to be more efficient. For the join, other methods are more efficient than the lookup for a lot of records.
Using a phonebook analogy, just start at the front of the phone book and read it to the end. Don't start at the start of the index and lookup each name one at a time until you get to the end of the index.
A covered index can help because it can be scanned in place of scanning the original table (clustered index). For example, if you have a two phone books where one has address information and the other does not, then reading the one without address information will be faster if you are not interested in addresses.
FWI: Don't trust the order of the columns for the index suggestions. Also, the index suggested in this case might be a covering index to avoid reading unneeded columns - not for selectivity.
What is the index creating strategy?
Is it possible to create more than one non-clustered index on the same column in SQL Server?
How about creating clustered and non-clustered on same column?
Very sorry, but indexing is very confusing to me.
Is there any way to find out the estimated query execution time in SQL Server?
The words are rather logical and you'll learn them quite quickly. :)
In layman's terms, SEEK implies seeking out precise locations for records, which is what the SQL Server does when the column you're searching in is indexed, and your filter (the WHERE condition) is accurrate enough.
SCAN means a larger range of rows where the query execution planner estimates it's faster to fetch a whole range as opposed to individually seeking each value.
And yes, you can have multiple indexes on the same field, and sometimes it can be a very good idea. Play out with the indexes and use the query execution planner to determine what happens (shortcut in SSMS: Ctrl + M). You can even run two versions of the same query and the execution planner will easily show you how much resources and time is taken by each, making optimization quite easy.
But to expand on these a bit, say you have an address table like so, and it has over 1 billion records:
CREATE TABLE ADDRESS
(ADDRESS_ID INT -- CLUSTERED primary key ADRESS_PK_IDX
, PERSON_ID INT -- FOREIGN KEY, NONCLUSTERED INDEX ADDRESS_PERSON_IDX
, CITY VARCHAR(256)
, MARKED_FOR_CHECKUP BIT
, **+n^10 different other columns...**)
Now, if you want to find all the address information for person 12345, the index on PERSON_ID is perfect. Since the table has loads of other data on the same row, it would be inefficient and space-consuming to create a nonclustered index to cover all other columns as well as PERSON_ID. In this case, SQL Server will execute an index SEEK on the index in PERSON_ID, then use that to do a Key Lookup on the clustered index in ADDRESS_ID, and from there return all the data in all other columns on that same row.
However, say you want to search for all the persons in a city, but you don't need other address information. This time, the most effective way would be to create an index on CITY and use INCLUDE option to cover PERSON_ID as well. That way, a single index seek / scan would return all the information you need without the need to resort to checking the CLUSTERED index for the PERSON_ID data on the same row.
Now, let's say both of those queries are required but still rather heavy because of the 1 billion records. But there's one special query that needs to be really really fast. That query wants all the persons on addresses that have been MARKED_FOR_CHECKUP, and who must live in New York (ignore whatever checkup means, that doesn't matter). Now you might want to create a third, filtered index on MARKED_FOR_CHECKUP and CITY, with INCLUDE covering PERSON_ID, and with a filter saying CITY = 'New York' and MARKED_FOR_CHECKUP = 1. This index would be insanely fast, as it only ever cover queries that satisfy those exact conditions, and therefore has a fraction of the data to go through compared to the other indexes.
(Disclaimer here, bear in mind that the query execution planner is not stupid, it can use multiple nonclustered indexes together to produce the correct results, so the examples above may not be the best ones available as it's very hard to imagine when you would need 3 different indexes covering the same column, but I'm sure you get the idea.)
The types of index, their columns, included columns, sorting orders, filters etc depend entirely on the situation. You will need to make covering indexes to satisfy several different types of queries, as well as customized indexes created specifically for singular, important queries. Each index takes up space on the HDD so making useless indexes is wasteful and requires extra maintenance whenever the data model changes, and wastes time in defragmentation and statistics update operations though... so you don't want to just slap an index on everything either.
Experiment, learn and work out which works best for your needs.
I'm not the expert on indexing either, but here is what I know.
You can have only ONE Clustered Index per table.
You can have up to a certain limit of non clustered indexes per table. Refer to http://social.msdn.microsoft.com/Forums/en-US/63ba3877-e0bd-4417-a04b-19c3bfb02ac9/maximum-number-of-index-per-table-max-no-of-columns-in-noncluster-index-in-sql-server?forum=transactsql
Indexes should just have different names, but its better not to use the same column(s) on a lot of different indexes as you will run into some performance problems.
A very important point to remember is that Indexes although it makes your select faster, influence your Insert/Update/Delete speed as the information needs to be added to the index, which means that the more indexes you have on a column that gets updated a lot, will drastically reduce the speed of the update.
You can include columns that is used on a CLUSTERED index in one or more NON-CLUSTERED indexes.
Here is some more reading material
http://www.sqlteam.com/article/sql-server-indexes-the-basics
http://www.programmerinterview.com/index.php/database-sql/what-is-an-index/
EDIT
Another point to remember is that an index takes up space just like the table. The more indexes you create the more space it uses, so try not to use char/varchar (or nchar/nvarchar) in an index. It uses to much space in the index, and on huge columns give basically no benefit. When your Indexes start to become bigger than your table, it also means that you have to relook your index strategy.
Imagine Foo table has non-clustered indexes on ColA and ColB
and NO Indexes on ColC, ColD
SELECT colA, colB
FROM Foo
takes about 30 seconds.
SELECT colA, colB, colC, colD
FROM Foo
takes about 2 minutes.
Foo table has more than 5 million rows.
Question:
Is it possible that including columns that are not part of the indexes can slow down the query?
If yes, WHY? -Are not they part of the already read PAGEs?
If you write a query that uses a covering index, then the full data pages in the heap/clustered index are not accessed.
If you subsequently add more columns to the query, such that the index is no longer covering, then either additional lookups will occur (if the index is still used), or you force a different data access path entirely (such as using a table scan instead of using an index)
Since 2005, SQL Server has supported the concept of Included Columns in an index. This includes non-key columns in the leaf of an index - so they're of no use during the data-lookup phase of index usage, but still help to avoid performing an additional lookup back in the heap/clustered index, if they're sufficient to make the index a covering index.
Also, in future, if you want to get a better understanding on why one query is fast and another is slow, look into generating Execution Plans, which you can then compare.
Even if you don't understand the terms used, you should at least be able to play "spot the difference" between them and then search on the terms (such as table scan, index seek, or lookup)
Simple answer is: because non-clustered index is not stored in the same page as data so SQL Server has to lookup actual data pages to pick up the rest.
Non-clustered index are stored in separate data structures while clustered indexes are stored in the same place as the actual data. That’s why you can have only one clustered index.
We have a view that is used to lookup a record in a table by clustered index. The view also has a couple of subqueries in the select statement that lookup data in two large tables, also by clustered index.
To hugely simplify it would be something like this:
SELECT a,
(SELECT b FROM tableB where tableB.a=tableA.a) as b
(SELECT c FROM tableC where tableC.a=tableA.a) as c
FROM tableA
Most lookups to [tableB] correctly use a non-clustered index on [tableB] and work very efficiently. However, very occasionally SQL Server, in generating an execution plan, has instead used an index on [tableB] that doesn't contain the value being passed through. So, following the example above, although an index of column [a] exists on tableB, the plan instead does a scan of a clustered index that has column [z]. Using SQL's own language the plan's "predicate is not relevant to the object". I can't see why this would ever be practical. As a result, when SQL does this, it has to scan every record in the index, because it would never exist, taking up to 30 seconds. It just seems plain wrong, always.
Has any one seen this before, where an execution plan does something that looks like it could never be right? I am going to rewrite the query anyway, so my concern is less about the structure of the query, but more as to why SQL would ever get it that wrong.
I know sometimes SQL Server can choose a plan that worked once and it can become inefficient as the dataset changes but in this case it could never work.
Further information
[tableB] has 4 million records, and most values for [a] are null
I'm unable now to get hold of the initial query that generated the plan
These queries are run through Coldfusion but at this time I'm interested in anyone having seen this independently in SQL Server
It just seems plain wrong, always.
You might be interested in the First Rule of Programming.
So, following the example above, although an index of column [a]
exists on tableB, the plan instead does a scan of a clustered index
that has column [z].
A clustered index always includes all rows. It might be ordered by z, but it will still contain all other columns at the leaf level.
The reason SQL Server sometimes prefers a clustered scan over an index seek is this. When you do an index seek, you have to follow it up with a bookmark lookup to the clustered index to retrieve columns that are not in the index.
When you do a clustered index scan, you by definition find all columns. That means no bookmark lookup is required.
When SQL Server expects many rows, it tries to avoid the bookmark lookups. This is a time-tested choice. Nonclustered index seeks are routinely beaten by clustered index scans.
You can test this for your case by forcing either with the with (index(IX_YourIndex)) query hint.
If I'm am trying to squeeze every last drop of performance out of a query what affect does having these types of index's being used by my joins.
clustered index.
non-clustered index.
clustered or non-clustered index with extra columns that may not be involved in the join.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
(I realize I may have to move the clustered index from another index(making that index non-clustered) since it can only have one.)
In addition to Gareth Saul's answer a tiny clarification:
Non-clustered indexes repeat the
included fields, with pointer to the
rows that have that value.
This pointer to the actual data value is the column (or the set of columns) that are in your clustering key.
That's one of the main reasons why you should try and keep the clustering key small and static - small because otherwise you'll waste a lot of space, on disk and in your server's RAM, and static because otherwise, you'll have to update not just your clustering index, but also all your non-clustered indices as well, if your value changes.
This "lookup pointer is the clustering key" feature has been in SQL Server since version 7, as Kim Tripp will explain in great detail here:
What is a clustered index?
In SQL Server 7.0 and higher the
internal dependencies on the
clustering key CHANGED. (Yes, it's
important to know that things CHANGED
in 7.0... why? Because there are still
some folks out there that don't
realize how RADICAL of a change
occurred in the internals (wrt to the
clustering key) in SQL Server 7.0).
What changed is that the clustering
key gets used as the "lookup" value
from the nonclustered indexes.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
Not as I understand. The point of a clustered index is that it then sorts the data on disk around that index (hence why you can only have the one), so if your join data isn't being sorted by those exact columns as well, I don't think it'd make any difference. Plus by putting data that might change (as opposed to the key) into the clustered index, you make it more likely that things will need rebuilding peridically, slowing the overall database down.
Sorry if this sounds a daft question, but have you tried running your query through the index tuning wizard? Not foolproof by any stretch but I've had some decent improvements from it in the past.
You only get one clustered index - this is what controls the physical storage of the table on disk / in memory.
Non-clustered indexes repeat the included fields, with pointer to the rows that have that value. Having an index on the columns being used in your joins should improve performance. You can further optimise by using "included columns" in your index - this duplicates the row information directly into the index, which can remove the performance penalty of having to look up the row itself to perform the select.
It is useful to pay attention to the order in which your joins occur - the sequence of columns in your index should match up to this. Remember that the SQL engine may optimise and re-order your query internally - profiling may be helpful.
In most situations, you can just use the Database Engine Tuning Advisor - the recommendations it provides are pretty much spot on.
If you can your best bet is for a non-clustered index that has all the element of your join in it and if possible the field you are selecting.
This will create a spanning index meaning that all the fields SQL requires to perform are on one index.
If possible have an index which has no unnessasery field in it. Every field added makes the an individual index record larger, the smaller each index record the more you get in each Page. The more index items you get in each page the less you have to go to the Disk.
Clustered Index - Will mean the table is layed out in the order specified in the Index, this means that you will get better performance for select * from TABLE where INDEXFIELD = 3. Unless you are selecting lots of large data items this should not be required.