SQL makes key lookup instead of seek over indexed column - sql-server

I have 1 large and 2 small tables inner joined. I added appropriate indexes over large table. Even if query is fast (most of the time) some times it is getting over 3 seconds. When I checked execution plan, seems like SQL goes with key lookup instead of index seek.
Here is my query;
and my execution plan;
and here execution details;
Am I missing something here?

A key lookup is a seek. It is looking up using the key.
The non clustered index always includes the clustered index key (or physical rid if the table isn't clustered in which case you get a "bookmark lookup" instead)
Because the index used in the previous index seek does not contain the CreateDate column it needs to use the clustered index key to seek into the clustered index to retrieve it. This type of seek to retrieve additional columns is called a key lookup.
If you wanted to get rid of the lookup you could consider adding CreateDate as an included column to the index on NewsCategoryUrlId.
Though as Hadi says in the comments your case sounds like parameter sniffing or outdated statistics. Often a plan with a non covering index seek and key lookups may be generated if the optimiser believes the parameter value to be selective and be problematic if it is not selective.
With parameter sniffing the problem can arise if a plan is compiled for a selective value and then cached and reused for a less selective value.
Outdated statistics may not reflect the true selectivity of the parameter value the plan is compiled for in the first place.

After suggestion from Martin Smith, I re-create my index as below;
and now, execution plan is mush satisfied for me;

Related

What column choose to create clustered index

I have a table which has over 25 millions rows. The table gets bigger every day (roughly 35 000 rows). I created nonclustered index on 2 columns - date and debt_id (these columns are used most frequently in WHERE clouse), each debt_id occurs only once in each date). So the table is still the heap because it doesn't have a clustered index. Do you think that it would be a good idea to add identity column (1,1) and create clustered index on it? Or what do you think I should do to boost performance on this table?
If your two columns are unique in any case, you can use them as clustered index.
Most important: A clustered index should not change its values, and new rows should be appended in the correct order.
The time of insertion as DATETIME2 as the first column of your clustered index is a good bet here.
The uniqueness must be guaranteed by the combination of this value and the debt_id you've mentioned.
Assuming that neither the time of insertion nor the debt_id are changing data, this looks like a very good combined PK.
Otherwise your clustered index might get fragmented. This would make things even worse... (The main reason why UNIQUEIDENTIFIER IDs tend to be very bad as clustered PK. Regularly running index repair scripts can be an acceptable workaround.)
A non-fragmented clustered index will speed up things, as long as your query filters on both columns (at least the first one must be involved).
You can add more indexes, you might even INCLUDE heavily needed values to them.
Other indexes will use the clustered index as lookup (might need recreation after building the clustered one). This helps if the clustered index is well performing and can make things worse if not.
So I'd say: If the above is true in your case, an additional ID IDENTITY is of little help. This will add one more step to each query, as the Query will need an additional lookup. But, if the index is prone to fragmentation, I'd rather add the additional ID. And finally, to cite George Menoutis in comments
Well, I certainly can't answer this; it is a deep design choice with
loads of pros, loads of cons, and loads of discussion
Without knowing your database and your needs this is pure guessing...

SQL Server 2005 cached an execution plan that could never work

We have a view that is used to lookup a record in a table by clustered index. The view also has a couple of subqueries in the select statement that lookup data in two large tables, also by clustered index.
To hugely simplify it would be something like this:
SELECT a,
(SELECT b FROM tableB where tableB.a=tableA.a) as b
(SELECT c FROM tableC where tableC.a=tableA.a) as c
FROM tableA
Most lookups to [tableB] correctly use a non-clustered index on [tableB] and work very efficiently. However, very occasionally SQL Server, in generating an execution plan, has instead used an index on [tableB] that doesn't contain the value being passed through. So, following the example above, although an index of column [a] exists on tableB, the plan instead does a scan of a clustered index that has column [z]. Using SQL's own language the plan's "predicate is not relevant to the object". I can't see why this would ever be practical. As a result, when SQL does this, it has to scan every record in the index, because it would never exist, taking up to 30 seconds. It just seems plain wrong, always.
Has any one seen this before, where an execution plan does something that looks like it could never be right? I am going to rewrite the query anyway, so my concern is less about the structure of the query, but more as to why SQL would ever get it that wrong.
I know sometimes SQL Server can choose a plan that worked once and it can become inefficient as the dataset changes but in this case it could never work.
Further information
[tableB] has 4 million records, and most values for [a] are null
I'm unable now to get hold of the initial query that generated the plan
These queries are run through Coldfusion but at this time I'm interested in anyone having seen this independently in SQL Server
It just seems plain wrong, always.
You might be interested in the First Rule of Programming.
So, following the example above, although an index of column [a]
exists on tableB, the plan instead does a scan of a clustered index
that has column [z].
A clustered index always includes all rows. It might be ordered by z, but it will still contain all other columns at the leaf level.
The reason SQL Server sometimes prefers a clustered scan over an index seek is this. When you do an index seek, you have to follow it up with a bookmark lookup to the clustered index to retrieve columns that are not in the index.
When you do a clustered index scan, you by definition find all columns. That means no bookmark lookup is required.
When SQL Server expects many rows, it tries to avoid the bookmark lookups. This is a time-tested choice. Nonclustered index seeks are routinely beaten by clustered index scans.
You can test this for your case by forcing either with the with (index(IX_YourIndex)) query hint.

Why isn't a particular index being used in a query?

I have a table named Workflow. It has 37M rows in it. There is a primary key on the ID column (int) plus an additional column. The ID column is the first column in the index.
If I execute the following query, the PK is not used (unless I use an index hint)
Select Distinct(SubID) From Workflow Where ID >= #LastSeenWorkflowID
If I execute this query instead, the PK is used
Select Distinct(SubID) From Workflow Where ID >= 786400000
I suspect the problem is with using the parameter value in the query (which I have to do). I really don't want to use an index hint. Is there a workaround for this?
Please post the execution plan(s), as well as the exact table definition, including all indexes.
When you use a variable the optimizer does no know what selectivity the query will have, the #LastSeenWorkflowID may filter out all but very last few rows in Workflow, or it may include them all. The generated plan has to work in both situations. There is a threshold at which the range seek over the clustered index is becoming more expensive than a full scan over a non-clustered index, simply because the clustered index is so much wider (it includes every column in the leaf levels) and thus has so much more pages to iterate over. The plan generated, which considers an unknown value for #LastSeenWorkflowID, is likely crossing that threshold in estimating the cost of the clustered index seek and as such it chooses the scan over the non-clustered index.
You could provide a narrow index that is aimed specifically at this query:
CREATE INDEX WorkflowSubId ON Workflow(ID, SubId);
or:
CREATE INDEX WorkflowSubId ON Workflow(ID) INCLUDE (SubId);
Such an index is too-good-to-pass for your query, no matter the value of #LastSeenWorkflowID.
Assuming your PK is an identity OR is always greater than 0, perhaps you could try this:
Select Distinct(SubID)
From Workflow
Where ID >= #LastSeenWorkflowID
And ID > 0
By adding the 2nd condition, it may cause the optimizer to use an index seek.
This is a classic example of local variable producing a sub-optimal plan.
You should use OPTION (RECOMPILE) in order to compile your query with the actual parameter value of ID.
See my blog post for more information:
http://www.sqlbadpractices.com/using-local-variables-in-t-sql-queries/

Is a SqlProfiler Scan Started bad?

If in SqlProfiler you can see that to execute a query a Scan is Started, does this mean a full table scan or can it just be a lookup? If it can be both, how can you tell which one of the two it is?
From the documentation:
The Scan:Started event class occurs when a table or index scan is started.
So it could be either one. The IndexID field will tell you if it is an index, and which one.
Not that it really matters very much. A clustered index scan basically is a table scan. A nonclustered index scan is better, but only a little. If you see any full scan, it means either (a) you're using non-sargable predicates or predicates on fields that aren't indexed, or (b) the predicate fields are indexed but the output columns aren't covered by the index, and the optimizer has decided that it is cheaper to perform a full scan than a bookmark/RID lookup.
Index scans aren't often much better than table scans, performance-wise, so you should try to eliminate whatever is leading to it, if possible.

SQL Server: How does the type of an index affect a join's performance?

If I'm am trying to squeeze every last drop of performance out of a query what affect does having these types of index's being used by my joins.
clustered index.
non-clustered index.
clustered or non-clustered index with extra columns that may not be involved in the join.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
(I realize I may have to move the clustered index from another index(making that index non-clustered) since it can only have one.)
In addition to Gareth Saul's answer a tiny clarification:
Non-clustered indexes repeat the
included fields, with pointer to the
rows that have that value.
This pointer to the actual data value is the column (or the set of columns) that are in your clustering key.
That's one of the main reasons why you should try and keep the clustering key small and static - small because otherwise you'll waste a lot of space, on disk and in your server's RAM, and static because otherwise, you'll have to update not just your clustering index, but also all your non-clustered indices as well, if your value changes.
This "lookup pointer is the clustering key" feature has been in SQL Server since version 7, as Kim Tripp will explain in great detail here:
What is a clustered index?
In SQL Server 7.0 and higher the
internal dependencies on the
clustering key CHANGED. (Yes, it's
important to know that things CHANGED
in 7.0... why? Because there are still
some folks out there that don't
realize how RADICAL of a change
occurred in the internals (wrt to the
clustering key) in SQL Server 7.0).
What changed is that the clustering
key gets used as the "lookup" value
from the nonclustered indexes.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
Not as I understand. The point of a clustered index is that it then sorts the data on disk around that index (hence why you can only have the one), so if your join data isn't being sorted by those exact columns as well, I don't think it'd make any difference. Plus by putting data that might change (as opposed to the key) into the clustered index, you make it more likely that things will need rebuilding peridically, slowing the overall database down.
Sorry if this sounds a daft question, but have you tried running your query through the index tuning wizard? Not foolproof by any stretch but I've had some decent improvements from it in the past.
You only get one clustered index - this is what controls the physical storage of the table on disk / in memory.
Non-clustered indexes repeat the included fields, with pointer to the rows that have that value. Having an index on the columns being used in your joins should improve performance. You can further optimise by using "included columns" in your index - this duplicates the row information directly into the index, which can remove the performance penalty of having to look up the row itself to perform the select.
It is useful to pay attention to the order in which your joins occur - the sequence of columns in your index should match up to this. Remember that the SQL engine may optimise and re-order your query internally - profiling may be helpful.
In most situations, you can just use the Database Engine Tuning Advisor - the recommendations it provides are pretty much spot on.
If you can your best bet is for a non-clustered index that has all the element of your join in it and if possible the field you are selecting.
This will create a spanning index meaning that all the fields SQL requires to perform are on one index.
If possible have an index which has no unnessasery field in it. Every field added makes the an individual index record larger, the smaller each index record the more you get in each Page. The more index items you get in each page the less you have to go to the Disk.
Clustered Index - Will mean the table is layed out in the order specified in the Index, this means that you will get better performance for select * from TABLE where INDEXFIELD = 3. Unless you are selecting lots of large data items this should not be required.

Resources