I'm exploring ways of improving the performance of an application which I can only affect on the database level to a limited degree. The SQL Server version is 2012 SP2 and the table and view structure in question is (I cannot really affect this + note that the xml document may have several hundred elements in total):
CREATE TABLE Orders(
id nvarchar(64) NOT NULL,
xmldoc xml NULL,
CONSTRAINT PK_Order_id PRIMARY KEY CLUSTERED (id)
);
CREATE VIEW V_Orders as
SELECT
a.id, a.xmldoc
,a.xmldoc.value('data(/row/c1)[1]', 'nvarchar(max)') "Stuff"
,a.xmldoc.value('data(/row/c2)[1]', 'nvarchar(max)') "OrderType"
etc..... many columns
from Orders a;
A typical query (and the one being used for testing below):
SELECT id FROM V_Orders WHERE OrderType = '30791'
All the queries are performed against the view and I can affect neither the queries nor the table/view structure.
I thought adding a selective XML index to the table would be my saviour:
CREATE SELECTIVE XML INDEX I_Orders_OrderType ON Orders(xmldoc)
FOR(
pathOrderType = '/row/c2' as SQL [nvarchar](20)
)
But even after updating the statistics the execution plan is looking weird. Couldn't post a pic as new account so the relevant details as text:
Clustered index seek from selectiveXml (Cost: 2% of total). Expected number of rows 1 but expected number of execution times 1269 (number of rows in the table)
-> Top N sort (Cost: 95% of total)
-> Compute scalar (Cost 0)
Separate branch: Clustered index scan PK_Order_id (Cost: 3% of total). Expected number of rows 1269
-> Merged to the Computer scalar results with Nested loops (Left outer join)
-> Filter
-> Final result (Expected number of rows 1269)
In actuality with my test data the query doesn't even return any results but whether it returns one or few doesn't make any difference. Execution times support the query really taking as long as could be deduced from the execution plan and have read counts in the thousands.
So my question is why is the selective xml index not being used properly by the optimizer? Or have I got something wrong? How would I optimize this specific query's performance with selective xml indexing (or perhaps persisted column)?
Edit:
I did additional testing with larger sample data (~274k rows in the table with XML documents close to average production sizes) and compared the selective XML index to a promoted column. The results are from Profiler trace, concentrating on CPU usage and read counts. The execution plan for selective xml indexing is basically identical to what is described above.
Selective XML index and 274k rows (executing the query above):
CPU: 6454, reads: 938521
After I updated the values in the searched field to be unique (total records still 274k) I got the following results:
Selective XML index and 274k rows (executing the query above):
CPU: 10077, reads: 1006466
Then using a promoted (i.e. persisted) separately indexed column and using it directly in the view:
CPU: 0, reads: 23
Selective XML index performance seems to be closer to full table scan than proper SQL indexed column fetch. I read somewhere that using schema for the table might help drop the TOP N step from execution plan (assuming we're searching for a non-repeating field) but I'm not sure whether that's a realistic possibility in this case.
The selective XML index you create is stored in an internal table with the primary key from Orders as the leading column for the clustered key for the internal table and the paths specified stored as sparse columns.
The query plan you get probably looks a something like this:
You have a scan over the entire Orders table with a seek in the internal table on the primary key for each row in Orders. The final Filter operator is responsible for checking the value of OrderType returning only the matching rows.
Not really what you would expect from something called an index.
To the rescue comes a secondary selective XML index. They are created for one of the paths specified in the primary selective index and will create a non-clustered key on the values extracted in the path expression.
It is however not all that easy. SQL Server will not use the secondary index on predicates used on values extracted by the values() function. You have to use exists() instead. Also, exists() requires the use of XQUERY data types in the path expressions where value() uses SQL data types.
Your primary selective XML index could look like this:
CREATE SELECTIVE XML INDEX I_Orders_OrderType ON Orders(xmldoc)
FOR
(
pathOrderType = '/row/c2' as sql nvarchar(20),
pathOrderTypeX = '/row/c2/text()' as xquery 'xs:string' maxlength (20)
)
With a secondary on pathOrderTypeX.
CREATE XML INDEX I_Orders_OrderType2 ON Orders(xmldoc)
USING XML INDEX I_Orders_OrderType FOR (pathOrderTypeX)
And with a query that uses exist() you will get this plan.
select id
from V_Orders
where xmldoc.exist('/row/c2/text()[. = "30791"]') = 1
The first seek is a seek for the value you are looking for in the non-clustered index of the internal table. The key lookup is done on the clustered key on the internal table (don't know why that is necessary). And the last seek is on primary key in the Orders table followed by a filter that checks for null values in the column xmldoc.
If you can get away with using property promotion, creating calculated indexed columns in the Orders table from the XML, I guess you would still get better performance than using secondary selective XML indexes.
Related
I have a table that looks something like this:
CREATE TABLE Records
(
ID UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED,
owner UNIQUEIDENTIFIER,
value FLOAT,
timestamp DATETIME
)
There is a multi-column clustered index on some other columns not relevant to this question.
The table currently has about 500,000,000 rows, and I need to operate on the table but it's too large to deal with currently (I am hampered by slow hardware). So I decided to work on it in chunks.
But if I say
SELECT ID
FROM records
WHERE ID LIKE '0000%'
The execution plan shows that the ENTIRE TABLE is scanned. I thought that with an index, only those rows that matched the original condition would be scanned until SQL reached the '0001' records. With the % in front, I could clearly see why it would scan the whole table. But with the % at the end, it shouldn't have to scan the whole table.
I am guessing this works different with GUIDs rather than CHAR or VARCHAR columns.
So my question is this: how can I search for a subsection of GUIDs without having to scan the whole table?
From your comments, I see the actual need is to break the rows of random GUID values into chunks (ordered) based on range. In this case, you can specify a range instead of LIKE along with a filter on the desired start/end values in the last group:
SELECT ID
FROM dbo.records
WHERE
ID BETWEEN '00000000-0000-0000-0000-000000000000'
AND '00000000-0000-0000-0000-000FFFFFFFFF';
This article explains how uniqueidentifiers (GUIDs) are stored and ordered in SQL Server, comparing and sorting the last group first rather than left-to-right as you might expect. By filtering on the last group, you'll get a sargable expression and touch only those rows in the specified range (assuming an index on ID is used).
In my SQL Server database I have a table of Requests with requestID (int) as Identity, PK and Clustered index. There are approximately 30 other columns in the table.
I am using Entity Framework to access the DB.
There is a function called GetRequestByID(int requestID) that pulls all the columns from the Requests table and columns from related tables using inner joins.
Recently, to reduce the amount of data pulled where not needed, I created two additional functions, GetRequestByID_Lite and GetRequestByID_EvenLiter that return lesser number of columns, and replaced all the relevant calls in the code.
For each of those functions I created a corresponding non-clustered index by requestID and including only the columns each function needs.
After one hour, first thing I see is that the memory consumed by the process decreased dramatically.
When I ran SYS.DM_DB_INDEX_USAGE_STATS, I see the following for the new indexes:
_index_for_GetRequestByID_Lite - 0 seeks, 422 scans, 0 lookups, 49 updates
_index_for_GetRequestByID_EvenLiter - 0 seeks, 0 scans, 0 lookups, 51 updates
My question is why so many scans and no seeks for _index_for_GetRequestByID_Lite?
If the index doesn't contain all the columns required, then why doesn't SQL Server just use the clustered index?
And why _index_for_GetRequestByID_EvenLiter is not being used at all (there is no doubt the function GetRequestByID_EvenLiter is called a lot)?
Also, when I run an SQL query equivalent to GetRequestByID_EvenLiter, the Clustered index is used in execution plan instead of _index_for_GetRequestByID_EvenLiter.
Thank You.
SQLServer might not have found your index effective in terms of cost.
see below example
create table
test
(
col1 int primary key,
col2 int,
col3 int,
col4 varchar(10),
col5 datetime
)
insert into test
select number,number+1,number+2,number+5,dateadd(day,number,getdate())
from numbers
Let's create an index
create index nc_Col2 on test(col2)
include(Col3,col4)
Now if we run a query like below
select * from test
where col2>4
and see execution plan cost...
You might have thought sqlserver should have used above index,but it didn't.Now let's observe the cost when we force sqlserver to use that index
select * from test with (index (nc_col2))
where col2>4
In summary ,the reason being your index might not be used may be due to
It is not cost effective compared to other existing possibilties
your index is not efficient as shown in my example( i am selecting * and index has only three columns)
also there are some more concepts like allocation scan,sequential scan,but in summary SQL has to believe your index costs less.Check out below links to see how to improve costing
Further reading:
Inside the Optimizer: Plan Costing
https://dba.stackexchange.com/a/23716/31995
After running the following query:
SELECT [hour], count(*) as hits, avg(elapsed)
FROM myTable
WHERE [url] IS NOT NULL and floordate >= '2017-05-01'
group by [hour]
the execution plan is basically a clustered Index Scan on the PK (int, auto-increment, 97% of the work)
The thing is: URL has a index on it (regular index because i'm always searching for a exact match), floordate also has an index...
Why are they not being used? How can i speed up this query?
PS: table is 70M items long and this query takes about 9 min to run
Edit 1
If i don't use (select or filter) a column on my index, will it still be used? Usually i also filter-for/group-by clientId (approx 300 unique across the db) and hour (24 unique)...
In this scenario, two things affect how SQL Server will choose an index.
How selective is the index. A higher selectivity is better. NULL/NOT NULL filters generally have a very low selectivity.
Are all of the columns in the index, also known as a covering index.
In your example, if the index cannot cover the query, SQL will have to look up the other column values against the base table. If your URL/Floordate combination is not selective enough, SQL may determine it is cheaper to scan the base table rather than do an expensive lookup from the non-clustered index to the base table for a large number of rows.
Without knowing anything else about your schema, I'd recommend an index with the following columns:
floordate, url, hour; include elapsed
Date ranges scans are generally more selective than a NULL/NOT NULL test. Moving Floordate to the front may make this index more desirable for this query. If SQL determines the query is good for Floordate and URL, the Hour column can be used for the Group By action. Since Elapsed is included, this index can cover the query completely.
You can include ClientID after hour to see if that helps your other query as well.
As long as an index contains all of the columns to resolve the query, it is a candidate for use, even if there is no filtering needed. Generally speaking, a non-clustered index is skinnier than the base table, requiring less IO than scanning the full width base table.
I'm in the process of trying to optimize a query that looks up historical data. I'm using the query analyzer to lookup the Execution Plan and have found that the majority of my query cost is on something called a "Bookmark Lookup". I've never seen this node in an execution plan before and don't know what it means.
Is this a good thing or a bad thing in a query?
A bookmark lookup is the process of finding the actual data in the SQL table, based on an entry found in a non-clustered index.
When you search for a value in a non-clustered index, and your query needs more fields than are part of the index leaf node (all the index fields, plus any possible INCLUDE columns), then SQL Server needs to go retrieve the actual data page(s) - that's what's called a bookmark lookup.
In some cases, that's really the only way to go - only if your query would require just one more field (not a whole bunch of 'em), it might be a good idea to INCLUDE that field in the non-clustered index. In that case, the leaf-level node of the non-clustered index would contain all fields needed to satisfy your query (a "covering" index), and thus a bookmark lookup wouldn't be necessary anymore.
Marc
It's a NESTED LOOP which joins a non-clustered index with the table itself on a row pointer.
Happens for the queries like this:
SELECT col1
FROM table
WHERE col2 BETWEEN 1 AND 10
, if you have an index on col2.
The index on col2 contains pointers to the indexed rows.
So, in order to retrieve the value of col1, the engine needs to scan the index on col2 for the key values from 1 to 10, and for each index leaf, refer to the table itself using the pointer contained in the leaf, to find out the value of col1.
This article points out that a Bookmark Lookup is SQL Server 2000's term, which is replaced by NESTED LOOP's between the index and the table in SQL Server 2005 and above
From MSDN regarding Bookmark Lookups:
The Bookmark Lookup operator uses a
bookmark (row ID or clustering key) to
look up the corresponding row in the
table or clustered index. The Argument
column contains the bookmark label
used to look up the row in the table
or clustered index. The Argument
column also contains the name of the
table or clustered index in which the
row is looked up. If the WITH PREFETCH
clause appears in the Argument column,
the query processor has determined
that it is optimal to use asynchronous
prefetching (read-ahead) when looking
up bookmarks in the table or clustered
index.
In this blog post, I need clarification why SQL server would choose a particular type of scan:
Let’s assume for simplicities sake
that col1 is unique and is ever
increasing in value, col2 has 1000
distinct values and there are
10,000,000 rows in the table, and that
the clustered index consists of col1,
and a nonclustered index exists on
col2.
Imagine the query execution plan
created for the following initially
passed parameters: #P1= 1 #P2=99
These values would result in an
optimal queryplan for the following
statement using the substituted
parameters:
Select * from t where col1 > 1 or col2
99 order by col1;
Now, imagine the query execution plan
if the initial parameter values were:
#P1 = 6,000,000 and #P2 = 550.
As before, an optimal queryplan would
be created after substituting the
passed parameters:
Select * from t where col1 > 6000000
or col2 > 550 order by col1;
These two identical parameterized SQL
Statements would potentially create
and cache very different execution
plans due to the difference of the
initially passed parameter values.
However, since SQL Server only caches
one execution plan per query, chances
are very high that in the first case
the query execution plan will utilize
a clustered index scan because of the
‘col1 > 1’ parameter substitution.
Whereas, in the second case a query
execution plan using index seek would
most likely be created.
from: http://blogs.msdn.com/sqlprogrammability/archive/2008/11/26/optimize-for-unknown-a-little-known-sql-server-2008-feature.aspx
Why would the first query use a clustered index, and a index seek in the second query?
Assuming that the columns contain only positive integers:
SQL Server would look at the statistics for the table and see that, for the first query, all rows in the table meet the criteria of col1>1, so it chooses to scan the clustered index.
For the second query, a relatively small proportion of rows would meet the criteria of col1> 6000000, so using an index seek would improve performance.
Notice that in both cases the clustered index will be used. In the first example it is a clustered index SCAN where as in the second example it will be a clustered index SEEK which in most cases will be the faster as the author of the blog states.
SQL Server knows that the clustered index is increasing. Therefore it will do a clustered index scan in the first case.
In cases where the optimizer sees that the majority of the table will be returned in the query, such as the first query, then it's more efficient to perform a scan then a seek.
Where only a small portion of the table will be returned, such as in the second query, then an index seek is more efficient.
A scan will touch every row in the table whether it qualifies or not. The cost is proportional to the total number of rows in the table. A scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.
A seek will touch rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.