The following query performs badly because of a full non-clustered index scan of 6.5 million records in P4FileReleases followed by a hash join. I'm looking for possible reasons the optimizer picks a scan over a seek.
SELECT p4f.FileReleaseID
FROM P4FileReleases p4f
INNER JOIN AnalyzedFileView af
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
WHERE (af.tracked_change_id = 1)
From what I can tell, I see no reason for the optimizer to pick a scan of P4FileReleases. The WHERE clause limits the size of the right dataset to about 1K of records and the optimizer should know it (see the histogram below).
If fact, if I take the view data and throw it into a heap table (same structure as the indexed view), then the query is performed with an index seek on the larger table and an inner join loop instead of a hash join (and the total cost drops from 145 to around 1).
Any ideas on what might be throwing the optimizer off?
Details. Sql Server 2008 (v. 10.0.2757.0).
P4FileReleases table
Holds 6.5 million records
CREATE TABLE [dbo].[P4FileReleases](
[FileReleaseID] [int] IDENTITY(1,1) NOT NULL,
[FileRelease] [varchar](254) NOT NULL,
-- 5 more fields
[FileReleaseID] ASC
[FileRelease] ASC
is an indexed view with statistics enabled and up-to-date.
It has four columns:
key int (int, PK) - clustered index
tracked_change_id (int, FK) - non-unique, non-clustered index (covering 'path', 'revision')
path (nvarchar(1024), null)
revision (smallint, null)
tracked_change_id histogram:
1 0 1222 0 1
4 0 787 0 1
8 0 2754 0 1
12 0 254 0 1
13 0 34 0 1
Query Plan
|--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([Expr1011])=([Expr1010]), RESIDUAL:([Expr1010]=[Expr1011]))
|--Bitmap(HASH:([Expr1011]), DEFINE:([Bitmap1015]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
| |--Compute Scalar(DEFINE:([Expr1011]=([qpsitools].[dbo].[analyzed_file_view].[path]+N'#')+CONVERT_IMPLICIT(nvarchar(30),CONVERT(varchar(30),[qpsitools].[dbo].[analyzed_file_view].[revision],0),0)))
| |--Index Seek(OBJECT:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]), SEEK:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]=(1)) ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1010]), WHERE:(PROBE([Bitmap1015],[Expr1010])))
|--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(nvarchar(254),[Blueprint].[dbo].[P4FileReleases].[FileRelease] as [p4f].[FileRelease],0)))
|--Index Scan(OBJECT:([Blueprint].[dbo].[P4FileReleases].[NCIX_P4FileReleases_FileRelease] AS [p4f]))

You are joining varchar column p4f.FileRelease with an nvarchar column (af.path). Since the data types don't match, SQL has to convert one's type to the other's (and of course it can't go from nvarchar to varchar). In converting af.path to nvarchar, it loses the ability to use the index to lookup/filter those values, resulting in the need to scan and convert all possible rows.
The best solution is to store the data as matching data types (change column p4f.FileRelase to nvarchar, or af.path to varchar). Since no one ever gets to modify existing database structures, a work-around might be to explicitly cast af.path to varchar in the query. Test it and see... though of course you can't do this if the data truly requires double-byte formatting.

your problem is not the WHERE but the JOIN, you are getting an implicit conversion and a scan on the JOIN, on the WHERE condition you are getting a SEEK
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
Parallelism could also be a problem, try adding MAXDOP=1
Are your statistics up to date? Is there excessive fragmentation?

Try moving "af.tracked_change_id = 1" into the join clause.
INNER JOIN AnalyzedFileView af
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
AND af.tracked_change_id = 1
WHERE is applied after the INNER JOIN

Philip Kelley spotted the problem. It was a datatype mismatch between varchar in P4FileReleases and nvarchar in AnalyzedFileView.


Performance Strategy for growing data

I know that performance tuning is something which need to be done specific to each environment. But I have put maximum effort to make my question clear to see if I am missing something in the possible improvements.
I have a table [TestExecutions] in SQL Server 2005. It has around 0.2 million records as of today. It is expected to grow as 5 million in couple of months.
CREATE TABLE [dbo].[TestExecutions]
[TestExecutionID] [int] IDENTITY(1,1) NOT NULL,
[OrderID] [int] NOT NULL,
[LineItemID] [int] NOT NULL,
[Manifest] [char](7) NOT NULL,
[RowCompanyCD] [char](4) NOT NULL,
[RowReferenceID] [int] NOT NULL,
[RowReferenceValue] [char](3) NOT NULL,
[ExecutedTime] [datetime] NOT NULL
CREATE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
I have following two queries for same purpose (Query2 and Query 3). For 100 records in #OrdersForRC, the Query2 is working better (39% vs 47%) whereas with 10000 records in in #OrdersForRC the Query 3 is working better (53% vs 33%) as per the execution plan).
In the initial few months of use, the #OrdersForRC table will have close to 100 records. It will gradually increase to 2500 records over a couple of months.
In the following two approaches which one is good for such a incrementally growing scenario? Or is there any strategy to make one approach work better than the other even if data grows?
Note: In Plan2, the first Query uses Hash Match
query optimizer operator choice - nested loops vs hash match (or merge)
Execution Plan Basics — Hash Match Confusion
Test Query
--SELECT DISTINCT TOP 100 OrderID FROM [TestExecutions]
SELECT DISTINCT TOP 5000 OrderID FROM LWManifestReceiptExecutions
--QUERY 2:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
ON R.OrderID = H.OrderID
--QUERY 3:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
Plan 1
Plan 2
AS commented above you have not specified table definition of table LWManifestReceiptExecutions and how many rows in it and
You are selecting Top N rows without order by, Do you want TOP N random id or in a specific order or order does`t matter for You?
if order does matter then you can create a index on column which you required in Order By
if order id is unique in [dbo].[TestExecutions] table then you should mark it as unique drop and recreate the index if UNIQUE
Drop Index [IX_TestExecutions_OrderID] ON [dbo].[TestExecutions]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
You asked that data is keep growing and it will reach to millions in couple of months.
No need to worry sql server can easily handle these query with proper build schema and indexes,
When this data model starting hurting then you could look at the
other options but not now, i have seen people handling billions of data in sql server.
I can see you are comparing the queries on the bases of query cost you are coming the conclusion that
Query with higher percentages mean this is more expensive,
That is not the case always query cost is based on aggregate Subtree cost of all Iterator in the query plan,
and the total estimated cost of an Iterator is a simple sum of the I/O and CPU components.
The cost values represent expected execution times (in seconds) on a particular hardware configuration
But with the morden hardware these cost might be irrelevant.
Now coming to your query,
You have expressed two queries to get the result but both are not identical,
IN PLAN 1 Query 1
Expressed by JOIN
QO is choosing Nested loop join that is good choice for particular this scenario
Every row for the key OrderID IN table #OrdersForRC seeking the value in the table dbo.[TestExecutions]
until all rows matched
IN PLAN 2 Query 2
Expressed by IN
QO is doing the same thing as query one but there is extra distinct Sort ( Sort and Stream aggregated)
the reasoning behind it is you have expressed this query as IN and table #OrdersForRC can contain duplicate Rows
Just to eliminate that is necessary.
IN PLAN 2 Query 1
Expressed by JOIN
Now the Rows in the table in #OrdersForRC in 1000, QO is choosing hash join over loop join
Because loop join for 1000 rows has more cost than hash join and loop join and rows are unordered
and can consist nulls as well thus HASH JOIN is perfect stratergy here.
IN PLAN 2 Query 2
Expressed by IN
QO has chosen Distinct Sort for the same reason as chosen in Plan 2 query 2 and then Merge Join
Because rows are now sorted ON ID column for both tables.
IF you just mark temp table as NOT NULL and Unique then its more likly you will get the same execution plan for both IN the JOIN.
(OrderID INT not null Unique)
Execution plan

Why this query is running so slow?

This query runs very fast (<100 msec):
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.

Need some assistance understanding a SQL Server 2012 query plan

I have the following query:
Select TOP 5000
From dbo.PanelCdCl CdCl WITH (NOLOCK)
Inner Join dbo.PanelHistory PH ON PH.SubId = CdCl.SubId
Where CdCl.PanelCdClStatusId IS NULL And PH.LastProcessNumber >= 1605
Order By CdCl.SubId
The query plan looks as follows:
Both the PanelCdCl and PanelHistory tables have a clustered index / primary key on SubId, and it's the only column in the index. There is exactly one row for each SubId in each table. Both tables have ~35M total rows in them.
I'm curious why the query plan is showing a clustered index scan on PanelHistory when the join is being done on the clustered index column.
It's not scanning PanelHistory's clustered index(SubId) to find a SubId, it's scanning on it to find all rows where LastProcessNumber >= 1605. This is the first logical step.
Then it likewise scans PanelCdCl to find all non-null PanelCdClStatusId rows. Then since they had the same index (SubId), they are both already sorted on the Join column, so it can do a Merge-Join without an additional sort. (Merge-Join is almost always the most efficient if it doesn't have to re-sort the input rows).
Then it doesn't have to do a Sort for the ORDER BY, because it's already in SubId order.
And finally, it does the TOP, which has to be after everything else (by the rules of SQL clause logical execution ordering).
So the only place it tests SubId values is in the Merge-Join, it never pushes it down to the scans. This would probably remain true if it did a Hash-Join instead. Only for a Nested-Loop Join would it have to push the SubId test down as a seek on a table, and that should only be the lower branch, not the upper one.
The merge join operator needs two sorted inputs. The clustered key is SubId in both tables which means that the scan in PanelHistory will give the rows in correct order. The clustered key is included in all non clustered key indexes so because of that you will have all rows in NCI IX_PanelCdCl_PanelCdClStatusId where PanelCdClStatusId is null ordered by SubId as well so that can also be used directly by the merge join.
What you see here is actually two scans, one of the clustered key in PanelHistory with a residual predicate on LastProcessNumber > 1605 and one index range scan in IX_PanelCdCl_PanelCdClStatusId as long as PanelCdClStatusId is null.
They will however not scan the entire table/index. The query is executed from left to right in the query plan where select is asking for one row at a time until there is no more rows to be had. That means that the top operator will stop asking for new rows from the merge join when it has the required 5000 rows.

Sequential Guid and fragmentation

I'm trying to understand how sequential guid performs better than a regular guid.
Is it because with regular guid, the index use the last byte of the guid to sort? Since it's random it will cause alot of fragmentation and page splits since it will often move data to another page to insert new data?
Sequential guid sine it is sequential it will cause alot less page splits and fragmentation?
Is my understanding correct?
If anyone can shed more lights on the subject, I'll appreciated very much.
Thank you
Sequential guid = NEWSEQUENTIALID(),
Regular guid = NEWID()
You've pretty much said it all in your question.
With a sequential GUID / primary key new rows will be added together at the end of the table, which makes things nice an easy for SQL server. In comparison a random primary key means that new records could be inserted anywhere in the table - the chance of the last page for the table being in the cache is fairly likely (if that's where all of the reads are going), however the chance of a random page in the middle of the table being in the cache is fairly low, meaning additional IO is required.
On top of that, when inserting rows into the middle of the table there is the chance that there isn't enough room to insert the extra row. If this is the case then SQL server needs to perform additional expensive IO operations in order to create room for the record - the only way to avoid this is to have gaps scattered amongst the data to allow for extra records to be inserted (known as a Fill factor), which in itself causes performance issues because the data is spread over more pages and so more IO is required to access the entire table.
I defer to Kimberly L. Tripp's wisdom on this topic:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 263-1 rows.
Read more:
To visualize the whole picture util named ostress might be used.
E.g. you can create two tables: one with normal GUID as PK, another with sequential GUID:
-- normal one
CREATE TABLE dbo.YourTable(
[id] [uniqueidentifier] NOT NULL,
-- sequential one
CREATE TABLE dbo.YourTableSeq(
[id] [uniqueidentifier] NOT NULL CONSTRAINT [df_yourtable_id] DEFAULT (newsequentialid()),
Then with a given util you run a numbero of inserts with selection of statistics about index fragmentation:
ostress -Slocalhost -E -dYourDB -Q"INSERT INTO dbo.YourTable VALUES (NEWID()); SELECT count(*) AS Cnt FROM dbo.YourTable; SELECT AVG_FRAGMENTATION_IN_PERCENT AS AvgPageFragmentation, PAGE_COUNT AS PageCounts FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL , NULL, N'LIMITED') DPS INNER JOIN sysindexes SI ON DPS.OBJECT_ID = SI.ID AND DPS.INDEX_ID = SI.INDID WHERE SI.NAME = 'PK_YourTable';" -oE:\incoming\TMP\ -n1 -r10000
ostress -Slocalhost -E -dYourDB -Q"INSERT INTO dbo.YourTableSeq DEFAULT VALUES; SELECT count(*) AS Cnt FROM dbo.YourTableSeq; SELECT AVG_FRAGMENTATION_IN_PERCENT AS AvgPageFragmentation, PAGE_COUNT AS PageCounts FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL , NULL, N'LIMITED') DPS INNER JOIN sysindexes SI ON DPS.OBJECT_ID = SI.ID AND DPS.INDEX_ID = SI.INDID WHERE SI.NAME = 'PK_YourTableSeq';" -oE:\incoming\TMP\ -n1 -r10000
Then in file E:\incoming\TMP\query.out you will find your statistics.
My results are:
"Normal" GUID:
Records AvgPageFragmentation PageCounts
1000 87.5 8
2000 93.75 16
3000 96.15384615384616 26
4000 96.875 32
5000 96.969696969696969 33
10000 98.571428571428584 70
Sequential GUID:
Records AvgPageFragmentation PageCounts
1000 83.333333333333343 6
2000 63.636363636363633 11
3000 41.17647058823529 17
4000 31.818181818181817 22
5000 25.0 28
10000 12.727272727272727 55
As you can see with sequentially generated GUID being inserted, index is much less fragmented as the insert operation leads to new page allocation rarer.

Why does SQL choose an incorrect index in my case?

I have a table with two indices; one is a multi-column clustered index, on a 3 columns:
symbolid int16,
bartime int32,
typeid int8
The second is non clustered on
bartime int16
The select statement i'm trying to run is:
SELECT symbolID, vTrdBuy
FROM mvTrdHidUhd
WHERE typeID = 1
AND barDateTime = 44991
AND symbolid in (1010,1020,1030,1040,1050,1060)
I run this query on sql2008 using sql management studio editor and enabling actual execution plan, I found that the sql uses the second index and propse to create a new index for the three columns (symbolid,bartime,typeid) but nonclustered!!! (I think it sayes non clustered index as there is already clustered one)
This selection is wrong, again I rerun the same query and forced SQL to use the clusted index (using "with index") and performance is better as it should.
I have two questions here one related to this behavior and the second for the query itself
Why SQL chooses wrong index and propse the same index
Which one I should use in the "where" condition for better performance
symbolid in (1010,1020,1030,1040,1050,1060)
(symbolid = 1010 or symbolid = 1020 ..etc)
(symbolid between (1010 and 1060))
After Testing
I found that when I change the where condition from using IN to use >= and <=the non clustered index on bartime column gives better performance than clustered index on 3 columns.
SO I have two cases if the WHERE uses IN it is better to use the clustered index, if it contains >= and <= it uses the second one.
SELECT symbolID, vTrdBuy
FROM mvTrdHidUhd
WHERE typeID = 1
AND barDateTime = 44991
AND symbolid IN (1010,1020,1030,1040,1050,1060)
This condition is not covered by a single contiguous range of your clustered index.
These rows:
1010, 44991, 1
1010, 50000, 1
1020, 44991, 1
will come in order in the index, but your query will select the first and the third one, skipping the second.
SQL Server can use Clustered Index Seek if there is a limited number of predicates, like in your IN case. In this case it uses a number of ranges:
SELECT symbolID, vTrdBuy
FROM mvTrdHidUhd
WHERE (typeID = 1
AND barDateTime = 44991
AND symbolid = 1010)
(typeID = 1
AND barDateTime = 44991
AND symbolid = 1010)
OR …
But in case of a BETWEEN range on symbolid it cannot construct such a limited number of predicates, that's why it reverts to less efficient Clustered Index Scan (which scans on symbolid and just filters the wrong results out).
In this case your nonclustered index performs better.
You could rewrite your query like this:
SELECT symbolID, vTrdBuy
FROM mvTrdHidUhd
WHERE symbolid BETWEEN 1010 AND 1050
) s
JOIN mvTrdHidUhd m
ON m.symbolid = s.symbolid
AND m.typeID = 1
AND m.barDateTime = 44991
, which will use Clustered Index Seek on your table as well, both to build a list of DISTINCT symbolid and to join on this list.
Updating the statistics on the table / indexes may make it choose the correct index
Use symbolid BETWEEN 1010 AND 1050 if possible. The use of BETWEEN or = or >= or > or <n or <= or the combination of these with AND generally leads to better performance and better index selection than the use of OR or IN.
It is possible the order of index column affects whether the optimiser will choose your index. You indicate the index is (symbolid int16,bartime int32,typeid int8) but the symbolid is the least distinct value in your where clauses. This would require 6 index lookups for the 6 values you have.
I would probably start with the between statement but only testing with your data, server, indexes etc will prove the best case.
If you are going to create another index try the 2 other orders for those columns.
And as noted elsewhere update your statistics
You can also try out a covering index on (symbolid,bartime,typeid,mvTrdBuy)
Your query references four columns:
While the clustered index only covers three of them
The reason SQL Server ignores that index is that it's useless to it. The index is first sorted by symbolID, and you don't want a specific symbolID, but a bunch of random values. This means that it has to read all over the table.
The next column in the clustered index is vTrdBuy. This isn't used to help it to skip to the rows it actually wants.
Looking at the query, two columns are very specific in limiting what rows you want to return:
WHERE typeID = 1
AND barDateTime = 44991
Creating an index that starts with typeID and barDateTime can really be useful in helping SQL Server jump to the rows that you are interested in.
First SQL Server can jump right to the rows that are
typeID = 1.
Once there, it can jump right to bars where
barDateTime = March 8, 2023
It can do this by seeking right through the index, since the index is ordered by the columns in it. This is very fast, and it's eliminated the majority of rows from being looked at.
If you were to create the index:
it still might not useful if the query returns a lot of rows. In order to finish the SELECT statement, SQL Server still needs the vTrdBuy value. It has to do this by jumping through the table for each one of the rows that matches the criteria (called a Bookmark Lookup). If there are too many rows (say > 500), SQL Server will just forget the index and just scan the entire table - cause it would be faster.
You want to prevent the bookmark lookup, by letting it not have to go back to the table for the missing value, you want to include the value in the index:
CREATE INDEX IX_mvTrdHidUhd_FancyCovering ON mvTrdHidUhd
typeID, barDateTime, symbolID, vTrdBuy
Now you have an index that contians everything SQL Server wants, in the order that it wants, and you don't have to mess with the physical sort order (i.e. clustering) of the physical table.
