In this blog post, I need clarification why SQL server would choose a particular type of scan:
Let’s assume for simplicities sake
that col1 is unique and is ever
increasing in value, col2 has 1000
distinct values and there are
10,000,000 rows in the table, and that
the clustered index consists of col1,
and a nonclustered index exists on
col2.
Imagine the query execution plan
created for the following initially
passed parameters: #P1= 1 #P2=99
These values would result in an
optimal queryplan for the following
statement using the substituted
parameters:
Select * from t where col1 > 1 or col2
99 order by col1;
Now, imagine the query execution plan
if the initial parameter values were:
#P1 = 6,000,000 and #P2 = 550.
As before, an optimal queryplan would
be created after substituting the
passed parameters:
Select * from t where col1 > 6000000
or col2 > 550 order by col1;
These two identical parameterized SQL
Statements would potentially create
and cache very different execution
plans due to the difference of the
initially passed parameter values.
However, since SQL Server only caches
one execution plan per query, chances
are very high that in the first case
the query execution plan will utilize
a clustered index scan because of the
‘col1 > 1’ parameter substitution.
Whereas, in the second case a query
execution plan using index seek would
most likely be created.
from: http://blogs.msdn.com/sqlprogrammability/archive/2008/11/26/optimize-for-unknown-a-little-known-sql-server-2008-feature.aspx
Why would the first query use a clustered index, and a index seek in the second query?
Assuming that the columns contain only positive integers:
SQL Server would look at the statistics for the table and see that, for the first query, all rows in the table meet the criteria of col1>1, so it chooses to scan the clustered index.
For the second query, a relatively small proportion of rows would meet the criteria of col1> 6000000, so using an index seek would improve performance.
Notice that in both cases the clustered index will be used. In the first example it is a clustered index SCAN where as in the second example it will be a clustered index SEEK which in most cases will be the faster as the author of the blog states.
SQL Server knows that the clustered index is increasing. Therefore it will do a clustered index scan in the first case.
In cases where the optimizer sees that the majority of the table will be returned in the query, such as the first query, then it's more efficient to perform a scan then a seek.
Where only a small portion of the table will be returned, such as in the second query, then an index seek is more efficient.
A scan will touch every row in the table whether it qualifies or not. The cost is proportional to the total number of rows in the table. A scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.
A seek will touch rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.
Related
In my SQL Server database I have a table of Requests with requestID (int) as Identity, PK and Clustered index. There are approximately 30 other columns in the table.
I am using Entity Framework to access the DB.
There is a function called GetRequestByID(int requestID) that pulls all the columns from the Requests table and columns from related tables using inner joins.
Recently, to reduce the amount of data pulled where not needed, I created two additional functions, GetRequestByID_Lite and GetRequestByID_EvenLiter that return lesser number of columns, and replaced all the relevant calls in the code.
For each of those functions I created a corresponding non-clustered index by requestID and including only the columns each function needs.
After one hour, first thing I see is that the memory consumed by the process decreased dramatically.
When I ran SYS.DM_DB_INDEX_USAGE_STATS, I see the following for the new indexes:
_index_for_GetRequestByID_Lite - 0 seeks, 422 scans, 0 lookups, 49 updates
_index_for_GetRequestByID_EvenLiter - 0 seeks, 0 scans, 0 lookups, 51 updates
My question is why so many scans and no seeks for _index_for_GetRequestByID_Lite?
If the index doesn't contain all the columns required, then why doesn't SQL Server just use the clustered index?
And why _index_for_GetRequestByID_EvenLiter is not being used at all (there is no doubt the function GetRequestByID_EvenLiter is called a lot)?
Also, when I run an SQL query equivalent to GetRequestByID_EvenLiter, the Clustered index is used in execution plan instead of _index_for_GetRequestByID_EvenLiter.
Thank You.
SQLServer might not have found your index effective in terms of cost.
see below example
create table
test
(
col1 int primary key,
col2 int,
col3 int,
col4 varchar(10),
col5 datetime
)
insert into test
select number,number+1,number+2,number+5,dateadd(day,number,getdate())
from numbers
Let's create an index
create index nc_Col2 on test(col2)
include(Col3,col4)
Now if we run a query like below
select * from test
where col2>4
and see execution plan cost...
You might have thought sqlserver should have used above index,but it didn't.Now let's observe the cost when we force sqlserver to use that index
select * from test with (index (nc_col2))
where col2>4
In summary ,the reason being your index might not be used may be due to
It is not cost effective compared to other existing possibilties
your index is not efficient as shown in my example( i am selecting * and index has only three columns)
also there are some more concepts like allocation scan,sequential scan,but in summary SQL has to believe your index costs less.Check out below links to see how to improve costing
Further reading:
Inside the Optimizer: Plan Costing
https://dba.stackexchange.com/a/23716/31995
After running the following query:
SELECT [hour], count(*) as hits, avg(elapsed)
FROM myTable
WHERE [url] IS NOT NULL and floordate >= '2017-05-01'
group by [hour]
the execution plan is basically a clustered Index Scan on the PK (int, auto-increment, 97% of the work)
The thing is: URL has a index on it (regular index because i'm always searching for a exact match), floordate also has an index...
Why are they not being used? How can i speed up this query?
PS: table is 70M items long and this query takes about 9 min to run
Edit 1
If i don't use (select or filter) a column on my index, will it still be used? Usually i also filter-for/group-by clientId (approx 300 unique across the db) and hour (24 unique)...
In this scenario, two things affect how SQL Server will choose an index.
How selective is the index. A higher selectivity is better. NULL/NOT NULL filters generally have a very low selectivity.
Are all of the columns in the index, also known as a covering index.
In your example, if the index cannot cover the query, SQL will have to look up the other column values against the base table. If your URL/Floordate combination is not selective enough, SQL may determine it is cheaper to scan the base table rather than do an expensive lookup from the non-clustered index to the base table for a large number of rows.
Without knowing anything else about your schema, I'd recommend an index with the following columns:
floordate, url, hour; include elapsed
Date ranges scans are generally more selective than a NULL/NOT NULL test. Moving Floordate to the front may make this index more desirable for this query. If SQL determines the query is good for Floordate and URL, the Hour column can be used for the Group By action. Since Elapsed is included, this index can cover the query completely.
You can include ClientID after hour to see if that helps your other query as well.
As long as an index contains all of the columns to resolve the query, it is a candidate for use, even if there is no filtering needed. Generally speaking, a non-clustered index is skinnier than the base table, requiring less IO than scanning the full width base table.
I have a table MY_TABLE with approximately 9 million rows.
There are in total of 38 columns in this table. The columns that are relevant to my question are:
RECORD_ID: identity, bigint, with unique clustered index
RECORD_CREATED: datetime, with non-unique & non-clustered index
Now I run the following two queries and naturally expect the first one to execute faster because the data is being sorted by a column that has a unique clustered index but somehow it executes 271 times(!) slower.
SELECT TOP 1
RECORD_ID
FROM
MY_TABLE
WHERE
RECORD_CREATED >= '20140801'
ORDER BY
RECORD_ID
SELECT TOP 1
RECORD_ID
FROM
MY_TABLE
WHERE
RECORD_CREATED >= '20140801'
ORDER BY
RECORD_CREATED
The execution times are 1630 ms and 6 ms, respectively.
Please advise.
P.S.: Due to security policies of the environment I cannot see the execution plan or use SQL Profiler.
SQL Server has a few choices to make about how to perform this query. It could begin by sorting all the items, leveraging the indexes you mentioned, and then follow that up by filtering out any items that don't match the WHERE clause. However, it's typically faster to cut down on the size of the data set that you're working with first, so you don't have to sort as many items.
So SQL Server is most-likely choosing to perform the WHERE filter first. When it does this, it most likely starts by using the non-unique, non-clustered index on RECORD_CREATED to skip over all the items where RECORD_CREATED is less than '20140801', and then take all the items after that.
At that point, all the items are pre-sorted in the order in which they were found in the RECORD_CREATED index, so the second query requires no additional effort, but the first query then has to perform a sort on the records that have been chosen.
I have a table with 3 columns: ID, FirstName, Salary.
Now I have a clustered index on ID and when I execute this query
select * from table where FirstName = 'a'
I get the result using a clustered indexed scan and asks me to add a non clustered index on the first name.
When I add a non clustered index on the FirstName then I get the result as a result of a Index Seek.
I don't understand why an index seek? Since the non clustered index does not sort the data shouldn't it be an scan?
If the number of rows matched by your criteria (FirstName = 'a') is sufficiently small, then the SQL Server query optimizer will use an index seek in your non-clustered index to find the row(s) that match that criteria, and then in a second step use a Key lookup into the main data pages to get all the data columns (since you're using a SELECT * ..).
It's up to the optimizer to decide when a index seek + key lookup is more efficient than a clustered index scan. It depends mostly on selectivity - if your criteria matches too many rows, a clustered index scan will occur instead.
The "magic" that enables the query optimizer to make such decisions is the statistics about the data and data distribution that SQL Server keeps - both for indexes, as well as for additional, selective columns in your tables.
And if the statistics are out of date (since you've done a lot of updating and possibly deleting), then it's entirely possible that the query optimizer will use inefficient / unsuitable execution plans. This is one of the major performance issues - always make sure that your statistics are well maintained and up to date! (by using e.g. a maintenance plan that updates the statistics every night or every week - depending on how frequently you manipulate large chunks of data).
Imagine Foo table has non-clustered indexes on ColA and ColB
and NO Indexes on ColC, ColD
SELECT colA, colB
FROM Foo
takes about 30 seconds.
SELECT colA, colB, colC, colD
FROM Foo
takes about 2 minutes.
Foo table has more than 5 million rows.
Question:
Is it possible that including columns that are not part of the indexes can slow down the query?
If yes, WHY? -Are not they part of the already read PAGEs?
If you write a query that uses a covering index, then the full data pages in the heap/clustered index are not accessed.
If you subsequently add more columns to the query, such that the index is no longer covering, then either additional lookups will occur (if the index is still used), or you force a different data access path entirely (such as using a table scan instead of using an index)
Since 2005, SQL Server has supported the concept of Included Columns in an index. This includes non-key columns in the leaf of an index - so they're of no use during the data-lookup phase of index usage, but still help to avoid performing an additional lookup back in the heap/clustered index, if they're sufficient to make the index a covering index.
Also, in future, if you want to get a better understanding on why one query is fast and another is slow, look into generating Execution Plans, which you can then compare.
Even if you don't understand the terms used, you should at least be able to play "spot the difference" between them and then search on the terms (such as table scan, index seek, or lookup)
Simple answer is: because non-clustered index is not stored in the same page as data so SQL Server has to lookup actual data pages to pick up the rest.
Non-clustered index are stored in separate data structures while clustered indexes are stored in the same place as the actual data. That’s why you can have only one clustered index.