I'm in the process of trying to optimize a query that looks up historical data. I'm using the query analyzer to lookup the Execution Plan and have found that the majority of my query cost is on something called a "Bookmark Lookup". I've never seen this node in an execution plan before and don't know what it means.
Is this a good thing or a bad thing in a query?
A bookmark lookup is the process of finding the actual data in the SQL table, based on an entry found in a non-clustered index.
When you search for a value in a non-clustered index, and your query needs more fields than are part of the index leaf node (all the index fields, plus any possible INCLUDE columns), then SQL Server needs to go retrieve the actual data page(s) - that's what's called a bookmark lookup.
In some cases, that's really the only way to go - only if your query would require just one more field (not a whole bunch of 'em), it might be a good idea to INCLUDE that field in the non-clustered index. In that case, the leaf-level node of the non-clustered index would contain all fields needed to satisfy your query (a "covering" index), and thus a bookmark lookup wouldn't be necessary anymore.
Marc
It's a NESTED LOOP which joins a non-clustered index with the table itself on a row pointer.
Happens for the queries like this:
SELECT col1
FROM table
WHERE col2 BETWEEN 1 AND 10
, if you have an index on col2.
The index on col2 contains pointers to the indexed rows.
So, in order to retrieve the value of col1, the engine needs to scan the index on col2 for the key values from 1 to 10, and for each index leaf, refer to the table itself using the pointer contained in the leaf, to find out the value of col1.
This article points out that a Bookmark Lookup is SQL Server 2000's term, which is replaced by NESTED LOOP's between the index and the table in SQL Server 2005 and above
From MSDN regarding Bookmark Lookups:
The Bookmark Lookup operator uses a
bookmark (row ID or clustering key) to
look up the corresponding row in the
table or clustered index. The Argument
column contains the bookmark label
used to look up the row in the table
or clustered index. The Argument
column also contains the name of the
table or clustered index in which the
row is looked up. If the WITH PREFETCH
clause appears in the Argument column,
the query processor has determined
that it is optimal to use asynchronous
prefetching (read-ahead) when looking
up bookmarks in the table or clustered
index.
Related
I have a table in SQL Server with a three-column clustered index.
I have a table with columns (CustomerID, A, ProductID, C, OtherID) and I have a clustered key on (OtherID, CustomerID, ProductID).
Is there a performance hit for that column order (in the table, not the index?) Or is there a hidden advantage to re-ordering the key columns to the first three columns of the table: (OtherID,CustomerID,ProductID,A,C)
Seems like it shouldn't be a big problem, but implementations can have hidden performance costs.
(I was looking for the cause of a performance issue we were having, and this was just one of those "It shouldn't be a problem, but maybe it could be a problem..." kind of guesses.)
I won't assume what type of clustered index we are talking about here, so I will try to cover all the basics. I would have to say that, logically, the impact (performance or otherwise) of the ordinal position of the columns within your table in relation to their ordinal position within the clustered index is inconsequential (unless someone out there has something to prove me wrong).
Rowstore
Keep in mind that your table data and rowstore clustered indexes end up becoming separate logical structures. Per Microsoft regarding the clustered rowstore index architecture:
indexes are organized as B-Trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom nodes in the index are called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows.
So when we are talking about the physical storage of both the clustered index and the table data, we can think of them as separate structures. Looking at this image from the same link:
All three of these levels have at least one thing in common. They are all storing values (more or less) logically sorted by the value of your clustered index. Regardless of the ordinal position of the columns within your table structure, the leaf pages for your table data will be stored logically ordered by the columns/values within your clustered index. This is also true of your intermediate pages, which represent the storage of your clustered index values.
So all of that to say, the ordinal position of your columns within the clustered index is actually what determines how both the intermediate level and leaf pages are logically ordered, so the ordinal position of those columns within your table statement really has no impact to their storage order because of their inclusion in your clustered index.
Columnstore
Regarding clustered columnstore indexes, I would again say that it has no impact, but for a different (and simpler) reason. The columnstore index breaks up the column values in to separate logical structures, which have no relation to each other by way of their ordinal position. So regardless of the column's ordinal position within the table, when you query a value from a column you are querying the separate physical structure that represents that column's values (ignoring deltastore for simplicity here). Similarly, when you query multiple column's values, you are querying each individual logical structure that represents each column's values separately.
This is why you are not even able to specify a column list when creating a clustered columnstore index. The ordinal position of the columns within the columnstore index itself has no impact, so I'd imagine that the ordinal position of those columns within the table itself (or any relationship between the two) also has no impact.
Heap
Lastly, should anyone else ask, even with tables stored as a heap I would still argue that the ordinal position of columns within the table has no impact to any query performance. Under the hood, heaps are still stored and referenced by a sort of clustered index structure (I believe it would still be described that way).
Per Microsoft:
A rowstore is data that is logically organized as a table with rows and columns, and then physically stored in a row-wise data format. This has been the traditional way to store relational table data such as a heap or clustered B-tree index.
So heaps are still stored in an ordered fashion just like any other table created using a clustered index, but the main difference is that the value they are ordered by is simply non-business use value created in order to identify the row. As described by Microsoft:
If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).
This RID is not something you would ever normally use as a predicate to a query, which is the main disadvantage (since data is made to be queried, right?). But regardless, the ordinal position of these columns within your table still has no impact to how they are actually logically sorted/stored, so I can't imagine that it could impact your query performance.
I just created a table with TWO primary keys in SQL Server. One column is age, another is ID number and I set the option to CLUSTER INDEX, so it automatically creates a cluster index on both columns. However, when I query the table, the results only seem to sort the ID and completely disregard/ignore the AGE (other PK and other Cluster index column). Why is this? Why is it only sorting based on the first cluster index column?
The query optimizer may decide to use the physical ordering of the rows in the table if there is no advantage in ordering any other way. So, when you select from the table using a simple query, it may be ordered this way. It is very easy to assume that the rows are physically stored in the order specified within the definition of your clustered index. But this turns out to be a false assumption.
Please view the following article for more details: Clustered Index do “NOT” guarantee Physically Ordering or Sorting of Rows
Imagine Foo table has non-clustered indexes on ColA and ColB
and NO Indexes on ColC, ColD
SELECT colA, colB
FROM Foo
takes about 30 seconds.
SELECT colA, colB, colC, colD
FROM Foo
takes about 2 minutes.
Foo table has more than 5 million rows.
Question:
Is it possible that including columns that are not part of the indexes can slow down the query?
If yes, WHY? -Are not they part of the already read PAGEs?
If you write a query that uses a covering index, then the full data pages in the heap/clustered index are not accessed.
If you subsequently add more columns to the query, such that the index is no longer covering, then either additional lookups will occur (if the index is still used), or you force a different data access path entirely (such as using a table scan instead of using an index)
Since 2005, SQL Server has supported the concept of Included Columns in an index. This includes non-key columns in the leaf of an index - so they're of no use during the data-lookup phase of index usage, but still help to avoid performing an additional lookup back in the heap/clustered index, if they're sufficient to make the index a covering index.
Also, in future, if you want to get a better understanding on why one query is fast and another is slow, look into generating Execution Plans, which you can then compare.
Even if you don't understand the terms used, you should at least be able to play "spot the difference" between them and then search on the terms (such as table scan, index seek, or lookup)
Simple answer is: because non-clustered index is not stored in the same page as data so SQL Server has to lookup actual data pages to pick up the rest.
Non-clustered index are stored in separate data structures while clustered indexes are stored in the same place as the actual data. That’s why you can have only one clustered index.
I have a table named Workflow. It has 37M rows in it. There is a primary key on the ID column (int) plus an additional column. The ID column is the first column in the index.
If I execute the following query, the PK is not used (unless I use an index hint)
Select Distinct(SubID) From Workflow Where ID >= #LastSeenWorkflowID
If I execute this query instead, the PK is used
Select Distinct(SubID) From Workflow Where ID >= 786400000
I suspect the problem is with using the parameter value in the query (which I have to do). I really don't want to use an index hint. Is there a workaround for this?
Please post the execution plan(s), as well as the exact table definition, including all indexes.
When you use a variable the optimizer does no know what selectivity the query will have, the #LastSeenWorkflowID may filter out all but very last few rows in Workflow, or it may include them all. The generated plan has to work in both situations. There is a threshold at which the range seek over the clustered index is becoming more expensive than a full scan over a non-clustered index, simply because the clustered index is so much wider (it includes every column in the leaf levels) and thus has so much more pages to iterate over. The plan generated, which considers an unknown value for #LastSeenWorkflowID, is likely crossing that threshold in estimating the cost of the clustered index seek and as such it chooses the scan over the non-clustered index.
You could provide a narrow index that is aimed specifically at this query:
CREATE INDEX WorkflowSubId ON Workflow(ID, SubId);
or:
CREATE INDEX WorkflowSubId ON Workflow(ID) INCLUDE (SubId);
Such an index is too-good-to-pass for your query, no matter the value of #LastSeenWorkflowID.
Assuming your PK is an identity OR is always greater than 0, perhaps you could try this:
Select Distinct(SubID)
From Workflow
Where ID >= #LastSeenWorkflowID
And ID > 0
By adding the 2nd condition, it may cause the optimizer to use an index seek.
This is a classic example of local variable producing a sub-optimal plan.
You should use OPTION (RECOMPILE) in order to compile your query with the actual parameter value of ID.
See my blog post for more information:
http://www.sqlbadpractices.com/using-local-variables-in-t-sql-queries/
I am working on optimizing a SQL query that goes against a very wide table in a legacy system. I am not able to narrow the table at this point for various reasons.
My query is running slowly because it does an Index Seek on an Index I've created, and then uses a Bookmark Lookup to find the additional columns it needs that do not exist in the Index. The bookmark lookup takes 42% of the query time (according to the query optimizer).
The table has 38 columns, some of which are nvarchars, so I cannot make a covering index that includes all the columns. I have tried to take advantage of index intersection by creating indexes that cover all the columns, however those "covering" indexes are not picked up by the execution plan and are not used.
Also, since 28 of the 38 columns are pulled out via this query, I'd have 28/38 of the columns in the table stored in these covering indexes, so I'm not sure how much this would help.
Do you think a Bookmark Lookup is as good as it is going to get, or what would another option be?
(I should specify that this is SQL Server 2000)
OH,
the covering index with include should work. Another option might be to create a clustered indexed view containing only the columns you need.
Regards,
Lieven
You could create an index with included columns as another option
example from BOL, this is for 2005 and up
CREATE NONCLUSTERED INDEX IX_Address_PostalCode
ON Person.Address (PostalCode)
INCLUDE (AddressLine1, AddressLine2, City, StateProvinceID);
To answer this part "I have tried to take advantage of index intersection by creating indexes that cover all the columns, however those "covering" indexes are not picked up by the execution plan and are not used."
An index can only be used when the query is created in a way that it is sargable, in other words if you use function on the left side of the operator or leave out the first column of the index in your WHERE clause then the index won't be used. If the selectivity of the index is low then also the index won't be used
Check out SQL Server covering indexes for some more info