I have a table MY_TABLE with approximately 9 million rows.
There are in total of 38 columns in this table. The columns that are relevant to my question are:
RECORD_ID: identity, bigint, with unique clustered index
RECORD_CREATED: datetime, with non-unique & non-clustered index
Now I run the following two queries and naturally expect the first one to execute faster because the data is being sorted by a column that has a unique clustered index but somehow it executes 271 times(!) slower.
SELECT TOP 1
RECORD_ID
FROM
MY_TABLE
WHERE
RECORD_CREATED >= '20140801'
ORDER BY
RECORD_ID
SELECT TOP 1
RECORD_ID
FROM
MY_TABLE
WHERE
RECORD_CREATED >= '20140801'
ORDER BY
RECORD_CREATED
The execution times are 1630 ms and 6 ms, respectively.
Please advise.
P.S.: Due to security policies of the environment I cannot see the execution plan or use SQL Profiler.
SQL Server has a few choices to make about how to perform this query. It could begin by sorting all the items, leveraging the indexes you mentioned, and then follow that up by filtering out any items that don't match the WHERE clause. However, it's typically faster to cut down on the size of the data set that you're working with first, so you don't have to sort as many items.
So SQL Server is most-likely choosing to perform the WHERE filter first. When it does this, it most likely starts by using the non-unique, non-clustered index on RECORD_CREATED to skip over all the items where RECORD_CREATED is less than '20140801', and then take all the items after that.
At that point, all the items are pre-sorted in the order in which they were found in the RECORD_CREATED index, so the second query requires no additional effort, but the first query then has to perform a sort on the records that have been chosen.
Related
In my SQL Server database I have a table of Requests with requestID (int) as Identity, PK and Clustered index. There are approximately 30 other columns in the table.
I am using Entity Framework to access the DB.
There is a function called GetRequestByID(int requestID) that pulls all the columns from the Requests table and columns from related tables using inner joins.
Recently, to reduce the amount of data pulled where not needed, I created two additional functions, GetRequestByID_Lite and GetRequestByID_EvenLiter that return lesser number of columns, and replaced all the relevant calls in the code.
For each of those functions I created a corresponding non-clustered index by requestID and including only the columns each function needs.
After one hour, first thing I see is that the memory consumed by the process decreased dramatically.
When I ran SYS.DM_DB_INDEX_USAGE_STATS, I see the following for the new indexes:
_index_for_GetRequestByID_Lite - 0 seeks, 422 scans, 0 lookups, 49 updates
_index_for_GetRequestByID_EvenLiter - 0 seeks, 0 scans, 0 lookups, 51 updates
My question is why so many scans and no seeks for _index_for_GetRequestByID_Lite?
If the index doesn't contain all the columns required, then why doesn't SQL Server just use the clustered index?
And why _index_for_GetRequestByID_EvenLiter is not being used at all (there is no doubt the function GetRequestByID_EvenLiter is called a lot)?
Also, when I run an SQL query equivalent to GetRequestByID_EvenLiter, the Clustered index is used in execution plan instead of _index_for_GetRequestByID_EvenLiter.
Thank You.
SQLServer might not have found your index effective in terms of cost.
see below example
create table
test
(
col1 int primary key,
col2 int,
col3 int,
col4 varchar(10),
col5 datetime
)
insert into test
select number,number+1,number+2,number+5,dateadd(day,number,getdate())
from numbers
Let's create an index
create index nc_Col2 on test(col2)
include(Col3,col4)
Now if we run a query like below
select * from test
where col2>4
and see execution plan cost...
You might have thought sqlserver should have used above index,but it didn't.Now let's observe the cost when we force sqlserver to use that index
select * from test with (index (nc_col2))
where col2>4
In summary ,the reason being your index might not be used may be due to
It is not cost effective compared to other existing possibilties
your index is not efficient as shown in my example( i am selecting * and index has only three columns)
also there are some more concepts like allocation scan,sequential scan,but in summary SQL has to believe your index costs less.Check out below links to see how to improve costing
Further reading:
Inside the Optimizer: Plan Costing
https://dba.stackexchange.com/a/23716/31995
After running the following query:
SELECT [hour], count(*) as hits, avg(elapsed)
FROM myTable
WHERE [url] IS NOT NULL and floordate >= '2017-05-01'
group by [hour]
the execution plan is basically a clustered Index Scan on the PK (int, auto-increment, 97% of the work)
The thing is: URL has a index on it (regular index because i'm always searching for a exact match), floordate also has an index...
Why are they not being used? How can i speed up this query?
PS: table is 70M items long and this query takes about 9 min to run
Edit 1
If i don't use (select or filter) a column on my index, will it still be used? Usually i also filter-for/group-by clientId (approx 300 unique across the db) and hour (24 unique)...
In this scenario, two things affect how SQL Server will choose an index.
How selective is the index. A higher selectivity is better. NULL/NOT NULL filters generally have a very low selectivity.
Are all of the columns in the index, also known as a covering index.
In your example, if the index cannot cover the query, SQL will have to look up the other column values against the base table. If your URL/Floordate combination is not selective enough, SQL may determine it is cheaper to scan the base table rather than do an expensive lookup from the non-clustered index to the base table for a large number of rows.
Without knowing anything else about your schema, I'd recommend an index with the following columns:
floordate, url, hour; include elapsed
Date ranges scans are generally more selective than a NULL/NOT NULL test. Moving Floordate to the front may make this index more desirable for this query. If SQL determines the query is good for Floordate and URL, the Hour column can be used for the Group By action. Since Elapsed is included, this index can cover the query completely.
You can include ClientID after hour to see if that helps your other query as well.
As long as an index contains all of the columns to resolve the query, it is a candidate for use, even if there is no filtering needed. Generally speaking, a non-clustered index is skinnier than the base table, requiring less IO than scanning the full width base table.
Imagine Foo table has non-clustered indexes on ColA and ColB
and NO Indexes on ColC, ColD
SELECT colA, colB
FROM Foo
takes about 30 seconds.
SELECT colA, colB, colC, colD
FROM Foo
takes about 2 minutes.
Foo table has more than 5 million rows.
Question:
Is it possible that including columns that are not part of the indexes can slow down the query?
If yes, WHY? -Are not they part of the already read PAGEs?
If you write a query that uses a covering index, then the full data pages in the heap/clustered index are not accessed.
If you subsequently add more columns to the query, such that the index is no longer covering, then either additional lookups will occur (if the index is still used), or you force a different data access path entirely (such as using a table scan instead of using an index)
Since 2005, SQL Server has supported the concept of Included Columns in an index. This includes non-key columns in the leaf of an index - so they're of no use during the data-lookup phase of index usage, but still help to avoid performing an additional lookup back in the heap/clustered index, if they're sufficient to make the index a covering index.
Also, in future, if you want to get a better understanding on why one query is fast and another is slow, look into generating Execution Plans, which you can then compare.
Even if you don't understand the terms used, you should at least be able to play "spot the difference" between them and then search on the terms (such as table scan, index seek, or lookup)
Simple answer is: because non-clustered index is not stored in the same page as data so SQL Server has to lookup actual data pages to pick up the rest.
Non-clustered index are stored in separate data structures while clustered indexes are stored in the same place as the actual data. That’s why you can have only one clustered index.
I have a update query that runs slow (see first query below). I have an index created on the table PhoneStatus and column PhoneID that is named IX_PhoneStatus_PhoneID. The Table PhoneStatus contains 20 million records. When I run the following query, the index is not used and a Clustered Index Scan is used and in-turn the update runs slow.
UPDATE PhoneStatus
SET RecordEndDate = GETDATE()
WHERE PhoneID = 126
If I execute the following query, which includes the new FROM, I still have the same problem with the index not used.
UPDATE PhoneStatus
SET RecordEndDate = GETDATE()
FROM Cust_Profile.PhoneStatus
WHERE PhoneID = 126
But if I add the HINT to force the use of the index on the FROM it works correctly, and an Index Seek is used.
UPDATE PhoneStatus
SET RecordEndDate = GETDATE()
FROM Cust_Profile.PhoneStatus WITH(INDEX(IX_PhoneStatus_PhoneID))
WHERE PhoneID = 126
Does anyone know why the first query would not use the Index?
Update
In the table of 20 million records, each phoneID could show up 10 times at the most
BarDev
How many distinct PhoneIDs are in the 20M table? If the condition where PhoneID=126 is not selective enough, you may be hitting the index tipping point. If this query and access condition is very frequent, PhoneID is a good candidate for a clustered index leftmost key.
Pablo is correct, SQL Server will use an index only if it thinks this will run the query more efficiently. But with 20 million rows it should have known to use the index. I would imagine that you simply need to update statistics on the database.
Form more information, see http://msdn.microsoft.com/en-us/library/aa260645(SQL.80).aspx.
Take a look at Is an index seek always better or faster than an index scan?
Sometimes a seek and a scan will be the exact same.
An index might be disregarded because your stats could be stale or the selectivity of the index is so low that SQL Server thinks a scan will be better
Turn on stats and see if there are any differences between the query with and without a seek
SET STATISTICS io ON
UPDATE PhoneStatus
SET RecordEndDate = GETDATE()
WHERE PhoneID = 126
UPDATE PhoneStatus
SET RecordEndDate = GETDATE()
FROM Cust_Profile.PhoneStatus WITH(INDEX(IX_PhoneStatus_PhoneID))
WHERE PhoneID = 126
Now look at the reads that came back
SQLServer (or any other SQL Server product for that matter) if not forced to use any index at all. It will use it, if it thinks will help running the query more efficiently.
So, in your case, SQLServer is thinking that it doesn't need using IX_PhoneStatus_PhoneID and by using its clustered index might get better results. It might be wrong though, that's what index hints are for: letting the Server know it would do a better job by using other index.
If your table was recently created and populated, it might be the case that statistics are somewhat outdated. So you might want to force a statistic update.
To restate:
You have table PhoneStatus
With a clustered index
And a non-clustered index on columns PhoneStatus and PhoneId, in that order
You are issuing an update with "...WHERE PhoneId = 126"
There are 20 million rows in the table (i.e. it's big and then some)
SQL will take your query and try to figure out how to do the work without working over
the whole table. For your non-clustered index, the data might look like:
PhoneStatus PhoneID
A 124
A 125
A 126
B 127
C 128
C 129
C 130
etc.
The thing is, SQL will check the first column first, before it checks the value
of the second column. As the first column is not specified in the update, SQL
cannot "shortcut" through the index search tree to the relevant entries, and so will have to scan the entire table. (No, SQL is not clever enough to say "eh, I'll just check
the second column first", and yes, they're right to have done it that way.)
Since the non-clustered index won't make the query faster, it defaults to a table
scan -- and since there is a clustered index, that means it instead becomse a clustered index scan. (If the clustered index is on PhoneId, then you'd have optimal performance on your query, but I'm guessing that's not the case here.)
When you use the hint, it forces the use the non-clustered index, and that will be faster
than the full table scan if the table has a lot more columns than the index (which
essentially has only the two), because there'd be that much less data to sift through.
In this blog post, I need clarification why SQL server would choose a particular type of scan:
Let’s assume for simplicities sake
that col1 is unique and is ever
increasing in value, col2 has 1000
distinct values and there are
10,000,000 rows in the table, and that
the clustered index consists of col1,
and a nonclustered index exists on
col2.
Imagine the query execution plan
created for the following initially
passed parameters: #P1= 1 #P2=99
These values would result in an
optimal queryplan for the following
statement using the substituted
parameters:
Select * from t where col1 > 1 or col2
99 order by col1;
Now, imagine the query execution plan
if the initial parameter values were:
#P1 = 6,000,000 and #P2 = 550.
As before, an optimal queryplan would
be created after substituting the
passed parameters:
Select * from t where col1 > 6000000
or col2 > 550 order by col1;
These two identical parameterized SQL
Statements would potentially create
and cache very different execution
plans due to the difference of the
initially passed parameter values.
However, since SQL Server only caches
one execution plan per query, chances
are very high that in the first case
the query execution plan will utilize
a clustered index scan because of the
‘col1 > 1’ parameter substitution.
Whereas, in the second case a query
execution plan using index seek would
most likely be created.
from: http://blogs.msdn.com/sqlprogrammability/archive/2008/11/26/optimize-for-unknown-a-little-known-sql-server-2008-feature.aspx
Why would the first query use a clustered index, and a index seek in the second query?
Assuming that the columns contain only positive integers:
SQL Server would look at the statistics for the table and see that, for the first query, all rows in the table meet the criteria of col1>1, so it chooses to scan the clustered index.
For the second query, a relatively small proportion of rows would meet the criteria of col1> 6000000, so using an index seek would improve performance.
Notice that in both cases the clustered index will be used. In the first example it is a clustered index SCAN where as in the second example it will be a clustered index SEEK which in most cases will be the faster as the author of the blog states.
SQL Server knows that the clustered index is increasing. Therefore it will do a clustered index scan in the first case.
In cases where the optimizer sees that the majority of the table will be returned in the query, such as the first query, then it's more efficient to perform a scan then a seek.
Where only a small portion of the table will be returned, such as in the second query, then an index seek is more efficient.
A scan will touch every row in the table whether it qualifies or not. The cost is proportional to the total number of rows in the table. A scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.
A seek will touch rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.