In my SQL Server database I have a table of Requests with requestID (int) as Identity, PK and Clustered index. There are approximately 30 other columns in the table.
I am using Entity Framework to access the DB.
There is a function called GetRequestByID(int requestID) that pulls all the columns from the Requests table and columns from related tables using inner joins.
Recently, to reduce the amount of data pulled where not needed, I created two additional functions, GetRequestByID_Lite and GetRequestByID_EvenLiter that return lesser number of columns, and replaced all the relevant calls in the code.
For each of those functions I created a corresponding non-clustered index by requestID and including only the columns each function needs.
After one hour, first thing I see is that the memory consumed by the process decreased dramatically.
When I ran SYS.DM_DB_INDEX_USAGE_STATS, I see the following for the new indexes:
_index_for_GetRequestByID_Lite - 0 seeks, 422 scans, 0 lookups, 49 updates
_index_for_GetRequestByID_EvenLiter - 0 seeks, 0 scans, 0 lookups, 51 updates
My question is why so many scans and no seeks for _index_for_GetRequestByID_Lite?
If the index doesn't contain all the columns required, then why doesn't SQL Server just use the clustered index?
And why _index_for_GetRequestByID_EvenLiter is not being used at all (there is no doubt the function GetRequestByID_EvenLiter is called a lot)?
Also, when I run an SQL query equivalent to GetRequestByID_EvenLiter, the Clustered index is used in execution plan instead of _index_for_GetRequestByID_EvenLiter.
Thank You.
SQLServer might not have found your index effective in terms of cost.
see below example
create table
test
(
col1 int primary key,
col2 int,
col3 int,
col4 varchar(10),
col5 datetime
)
insert into test
select number,number+1,number+2,number+5,dateadd(day,number,getdate())
from numbers
Let's create an index
create index nc_Col2 on test(col2)
include(Col3,col4)
Now if we run a query like below
select * from test
where col2>4
and see execution plan cost...
You might have thought sqlserver should have used above index,but it didn't.Now let's observe the cost when we force sqlserver to use that index
select * from test with (index (nc_col2))
where col2>4
In summary ,the reason being your index might not be used may be due to
It is not cost effective compared to other existing possibilties
your index is not efficient as shown in my example( i am selecting * and index has only three columns)
also there are some more concepts like allocation scan,sequential scan,but in summary SQL has to believe your index costs less.Check out below links to see how to improve costing
Further reading:
Inside the Optimizer: Plan Costing
https://dba.stackexchange.com/a/23716/31995
Related
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
After running the following query:
SELECT [hour], count(*) as hits, avg(elapsed)
FROM myTable
WHERE [url] IS NOT NULL and floordate >= '2017-05-01'
group by [hour]
the execution plan is basically a clustered Index Scan on the PK (int, auto-increment, 97% of the work)
The thing is: URL has a index on it (regular index because i'm always searching for a exact match), floordate also has an index...
Why are they not being used? How can i speed up this query?
PS: table is 70M items long and this query takes about 9 min to run
Edit 1
If i don't use (select or filter) a column on my index, will it still be used? Usually i also filter-for/group-by clientId (approx 300 unique across the db) and hour (24 unique)...
In this scenario, two things affect how SQL Server will choose an index.
How selective is the index. A higher selectivity is better. NULL/NOT NULL filters generally have a very low selectivity.
Are all of the columns in the index, also known as a covering index.
In your example, if the index cannot cover the query, SQL will have to look up the other column values against the base table. If your URL/Floordate combination is not selective enough, SQL may determine it is cheaper to scan the base table rather than do an expensive lookup from the non-clustered index to the base table for a large number of rows.
Without knowing anything else about your schema, I'd recommend an index with the following columns:
floordate, url, hour; include elapsed
Date ranges scans are generally more selective than a NULL/NOT NULL test. Moving Floordate to the front may make this index more desirable for this query. If SQL determines the query is good for Floordate and URL, the Hour column can be used for the Group By action. Since Elapsed is included, this index can cover the query completely.
You can include ClientID after hour to see if that helps your other query as well.
As long as an index contains all of the columns to resolve the query, it is a candidate for use, even if there is no filtering needed. Generally speaking, a non-clustered index is skinnier than the base table, requiring less IO than scanning the full width base table.
I'm exploring ways of improving the performance of an application which I can only affect on the database level to a limited degree. The SQL Server version is 2012 SP2 and the table and view structure in question is (I cannot really affect this + note that the xml document may have several hundred elements in total):
CREATE TABLE Orders(
id nvarchar(64) NOT NULL,
xmldoc xml NULL,
CONSTRAINT PK_Order_id PRIMARY KEY CLUSTERED (id)
);
CREATE VIEW V_Orders as
SELECT
a.id, a.xmldoc
,a.xmldoc.value('data(/row/c1)[1]', 'nvarchar(max)') "Stuff"
,a.xmldoc.value('data(/row/c2)[1]', 'nvarchar(max)') "OrderType"
etc..... many columns
from Orders a;
A typical query (and the one being used for testing below):
SELECT id FROM V_Orders WHERE OrderType = '30791'
All the queries are performed against the view and I can affect neither the queries nor the table/view structure.
I thought adding a selective XML index to the table would be my saviour:
CREATE SELECTIVE XML INDEX I_Orders_OrderType ON Orders(xmldoc)
FOR(
pathOrderType = '/row/c2' as SQL [nvarchar](20)
)
But even after updating the statistics the execution plan is looking weird. Couldn't post a pic as new account so the relevant details as text:
Clustered index seek from selectiveXml (Cost: 2% of total). Expected number of rows 1 but expected number of execution times 1269 (number of rows in the table)
-> Top N sort (Cost: 95% of total)
-> Compute scalar (Cost 0)
Separate branch: Clustered index scan PK_Order_id (Cost: 3% of total). Expected number of rows 1269
-> Merged to the Computer scalar results with Nested loops (Left outer join)
-> Filter
-> Final result (Expected number of rows 1269)
In actuality with my test data the query doesn't even return any results but whether it returns one or few doesn't make any difference. Execution times support the query really taking as long as could be deduced from the execution plan and have read counts in the thousands.
So my question is why is the selective xml index not being used properly by the optimizer? Or have I got something wrong? How would I optimize this specific query's performance with selective xml indexing (or perhaps persisted column)?
Edit:
I did additional testing with larger sample data (~274k rows in the table with XML documents close to average production sizes) and compared the selective XML index to a promoted column. The results are from Profiler trace, concentrating on CPU usage and read counts. The execution plan for selective xml indexing is basically identical to what is described above.
Selective XML index and 274k rows (executing the query above):
CPU: 6454, reads: 938521
After I updated the values in the searched field to be unique (total records still 274k) I got the following results:
Selective XML index and 274k rows (executing the query above):
CPU: 10077, reads: 1006466
Then using a promoted (i.e. persisted) separately indexed column and using it directly in the view:
CPU: 0, reads: 23
Selective XML index performance seems to be closer to full table scan than proper SQL indexed column fetch. I read somewhere that using schema for the table might help drop the TOP N step from execution plan (assuming we're searching for a non-repeating field) but I'm not sure whether that's a realistic possibility in this case.
The selective XML index you create is stored in an internal table with the primary key from Orders as the leading column for the clustered key for the internal table and the paths specified stored as sparse columns.
The query plan you get probably looks a something like this:
You have a scan over the entire Orders table with a seek in the internal table on the primary key for each row in Orders. The final Filter operator is responsible for checking the value of OrderType returning only the matching rows.
Not really what you would expect from something called an index.
To the rescue comes a secondary selective XML index. They are created for one of the paths specified in the primary selective index and will create a non-clustered key on the values extracted in the path expression.
It is however not all that easy. SQL Server will not use the secondary index on predicates used on values extracted by the values() function. You have to use exists() instead. Also, exists() requires the use of XQUERY data types in the path expressions where value() uses SQL data types.
Your primary selective XML index could look like this:
CREATE SELECTIVE XML INDEX I_Orders_OrderType ON Orders(xmldoc)
FOR
(
pathOrderType = '/row/c2' as sql nvarchar(20),
pathOrderTypeX = '/row/c2/text()' as xquery 'xs:string' maxlength (20)
)
With a secondary on pathOrderTypeX.
CREATE XML INDEX I_Orders_OrderType2 ON Orders(xmldoc)
USING XML INDEX I_Orders_OrderType FOR (pathOrderTypeX)
And with a query that uses exist() you will get this plan.
select id
from V_Orders
where xmldoc.exist('/row/c2/text()[. = "30791"]') = 1
The first seek is a seek for the value you are looking for in the non-clustered index of the internal table. The key lookup is done on the clustered key on the internal table (don't know why that is necessary). And the last seek is on primary key in the Orders table followed by a filter that checks for null values in the column xmldoc.
If you can get away with using property promotion, creating calculated indexed columns in the Orders table from the XML, I guess you would still get better performance than using secondary selective XML indexes.
I have a table MY_TABLE with approximately 9 million rows.
There are in total of 38 columns in this table. The columns that are relevant to my question are:
RECORD_ID: identity, bigint, with unique clustered index
RECORD_CREATED: datetime, with non-unique & non-clustered index
Now I run the following two queries and naturally expect the first one to execute faster because the data is being sorted by a column that has a unique clustered index but somehow it executes 271 times(!) slower.
SELECT TOP 1
RECORD_ID
FROM
MY_TABLE
WHERE
RECORD_CREATED >= '20140801'
ORDER BY
RECORD_ID
SELECT TOP 1
RECORD_ID
FROM
MY_TABLE
WHERE
RECORD_CREATED >= '20140801'
ORDER BY
RECORD_CREATED
The execution times are 1630 ms and 6 ms, respectively.
Please advise.
P.S.: Due to security policies of the environment I cannot see the execution plan or use SQL Profiler.
SQL Server has a few choices to make about how to perform this query. It could begin by sorting all the items, leveraging the indexes you mentioned, and then follow that up by filtering out any items that don't match the WHERE clause. However, it's typically faster to cut down on the size of the data set that you're working with first, so you don't have to sort as many items.
So SQL Server is most-likely choosing to perform the WHERE filter first. When it does this, it most likely starts by using the non-unique, non-clustered index on RECORD_CREATED to skip over all the items where RECORD_CREATED is less than '20140801', and then take all the items after that.
At that point, all the items are pre-sorted in the order in which they were found in the RECORD_CREATED index, so the second query requires no additional effort, but the first query then has to perform a sort on the records that have been chosen.
In this blog post, I need clarification why SQL server would choose a particular type of scan:
Let’s assume for simplicities sake
that col1 is unique and is ever
increasing in value, col2 has 1000
distinct values and there are
10,000,000 rows in the table, and that
the clustered index consists of col1,
and a nonclustered index exists on
col2.
Imagine the query execution plan
created for the following initially
passed parameters: #P1= 1 #P2=99
These values would result in an
optimal queryplan for the following
statement using the substituted
parameters:
Select * from t where col1 > 1 or col2
99 order by col1;
Now, imagine the query execution plan
if the initial parameter values were:
#P1 = 6,000,000 and #P2 = 550.
As before, an optimal queryplan would
be created after substituting the
passed parameters:
Select * from t where col1 > 6000000
or col2 > 550 order by col1;
These two identical parameterized SQL
Statements would potentially create
and cache very different execution
plans due to the difference of the
initially passed parameter values.
However, since SQL Server only caches
one execution plan per query, chances
are very high that in the first case
the query execution plan will utilize
a clustered index scan because of the
‘col1 > 1’ parameter substitution.
Whereas, in the second case a query
execution plan using index seek would
most likely be created.
from: http://blogs.msdn.com/sqlprogrammability/archive/2008/11/26/optimize-for-unknown-a-little-known-sql-server-2008-feature.aspx
Why would the first query use a clustered index, and a index seek in the second query?
Assuming that the columns contain only positive integers:
SQL Server would look at the statistics for the table and see that, for the first query, all rows in the table meet the criteria of col1>1, so it chooses to scan the clustered index.
For the second query, a relatively small proportion of rows would meet the criteria of col1> 6000000, so using an index seek would improve performance.
Notice that in both cases the clustered index will be used. In the first example it is a clustered index SCAN where as in the second example it will be a clustered index SEEK which in most cases will be the faster as the author of the blog states.
SQL Server knows that the clustered index is increasing. Therefore it will do a clustered index scan in the first case.
In cases where the optimizer sees that the majority of the table will be returned in the query, such as the first query, then it's more efficient to perform a scan then a seek.
Where only a small portion of the table will be returned, such as in the second query, then an index seek is more efficient.
A scan will touch every row in the table whether it qualifies or not. The cost is proportional to the total number of rows in the table. A scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.
A seek will touch rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.