Using Index scan instead of seek with lookup

Using Index scan instead of seek with lookup - sql-server

I have a table with the following structure:
CREATE TABLE Article
(
id UNIQUEIDENTIFIER PRIMARY KEY,
title VARCHAR(60),
content VARCHAR(2000),
datePosted DATE,
srcImg VARCHAR(255),
location VARCHAR(255)
);
I then put a non clustered index on location:
CREATE NONCLUSTERED INDEX Articles_location
ON Articles (location);
Running a query like this one:
select a.content
from Articles a
where a.location = 'Japan, Tokyo';
results in an: "Index Scan (Clustered)"
Running another query like this :
select a.location
from Articles a
where a.location = 'Japan, Tokyo';
results in an: "Index Seek (NonClustered)"
So the nonclustered index is working. Why is it not doing a seek with lookup when I search by additional by columns but does a scan?
The total number of rows in the table is 200
The total amount of rows retrieved is 86 for this query

It looks like the query optimizer decides to scan the table instead of using an index based on the selectivity of the data.
It may be actually faster to refer to the table directly than to seek via the index and then perform a KeyLookup. This may not be the case if table has more rows (> 10k). Here 86 from 200 is more than 40%.
select a.content from Articles a where a.location = 'Japan, Tokyo';
-- clustered index scan
select a.location from Articles a where a.location = 'Japan, Tokyo';
-- covering index
Scans vs. Seeks
Thus, a seek is generally a more efficient strategy if we have a highly selective seek predicate; that is, if we have a seek predicate that eliminates a large fraction of the table.

Related

Non clustered indexes for special case

TABLE SellerTransactions
string SellerId,
string ProductId,
DateTime CreateDate,
string BankNumber,
string Name(name+' '+surname+' 'alias),
string Comments,
decimal Amount
etc...
what would be the best case scenario for search/filtering with non clustered index when we search by sellerID, ProductIds, CreateDate and sometimes Amount/ BankNumber.. should the non clustered index be only on (first sellerID, ProductIds, CreateDate) columns or on all possible columns where the search might happen (a single big non clustered index).
Query will always contain (sellerID, ProductIds, CreateDate) and sometimes additionally bankNumber/Amount.
Say 90% of the time sellerID, ProductIds, CreateDate will be searched and 10% of the time sellerID, ProductIds, CreateDate & Amount or bankNumber.
I was thinking having a nonclustered index on (sellerID, ProductIds, CreateDate) and separate ones for amount and bank number.

I think you have to use a filtered index to improve the performance of your query.
What is a filtered index?
Filtered index is used to get some portion of table.
i.e. a filtered index applies a filter on index which improves query performance.
For more info, see: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/create-filtered-indexes?view=sql-server-2017
Syntax
CREATE NONCLUSTERED INDEX Non_ClustredIndexName
ON Table(ColumnName)
WHERE ColumnName = #ColumnValue
Example as per your table:
1.
CREATE NONCLUSTERED INDEX FI_Employee_DOJ
ON tbl_SellerTransactions(ST_Name)
WHERE ST_Name IS NOT NULL
2.
CREATE NONCLUSTERED INDEX NonCluster_sellerID
ON tbl_SellerTransactions(sellerID)
WHERE sellerID BETWEEN '100' AND '500'
3.
CREATE NONCLUSTERED INDEX FI_Employee_DOJ
ON tbl_SellerTransactions(ST_Name)
INCLUDE(SellerId,amt,ProductId,BankNumber) --Including remaining columns in the index
WHERE ST_Name IS NOT NULL
Notice
Filtered index can be used on views only if filtered indexes are persisted views
Filtered indexes are not created fulltext indexes

Why index setting is able to affect query cost when scan is imperative

I'm having a review of performance tuning study and practicing with AdventureWorks2012.
I built 4 copies from Product table then setup with the following indexes.
--tmpProduct1 nothing
CREATE CLUSTERED INDEX cIdx ON tmpProduct2 (ProductID ASC)
CREATE NONCLUSTERED INDEX ncIdx ON tmpProduct3 (ProductID ASC)
CREATE NONCLUSTERED INDEX ncIdx ON tmpProduct4 (ProductID ASC) INCLUDE (Name, ProductNumber)
Then I do the execution plan with following queries.
SELECT ProductID FROM tmpProduct1
SELECT ProductID FROM tmpProduct2
SELECT ProductID FROM tmpProduct3
SELECT ProductID FROM tmpProduct4
I expected the performance should be the same to all four of them since they all need to scan. Plus, I select only ProductID column and there is no WHERE condition.
However, it turns out to be
Why is clustered index more expensive than non-clustered index?
Why non-clustered index reduce the cost in this scenario?
Why columns store makes query4 cost more than query3?

For query1 without indexes, you are scanning entire table..
For query2 ,you have a clustered index,but then again..you are scanning the entire table..any index is usefull only when you use to eliminate rows..so this is same as query1
Reason for query4 cost more than query 3 may be due to the index you have and the way indexes are stored..For know ,it is enough to know keys are stored at root level and data is stored at leaf level...For more info read this :https://www.sqlskills.com/blogs/kimberly/category/indexes/
For query3,there is only key,so the number of pages required to store the data will be less and thus requires less traversal
For query 4, you have few more columns,thus more pages and more traversal
Below screenshot shows you the pages tmproduct4(18),tmproduct3(15)..so the extra cost may be IO cost required to traverse additional pages

Why does SQL Server use an Index Scan instead of a Seek + RID lookup?

I have a table with approx. 135M rows:
CREATE TABLE [LargeTable]
(
[ID] UNIQUEIDENTIFIER NOT NULL,
[ChildID] UNIQUEIDENTIFIER NOT NULL,
[ChildType] INT NOT NULL
)
It has a non-clustered index with no included columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX]
ON [LargeTable]
(
[ChildID] ASC
)
(It is clustered on ID).
I wish to join this against a temporary table which contains a few thousand rows:
CREATE TABLE #temp
(
ChildID UNIQUEIDENTIFIER PRIMARY KEY,
ChildType INT
)
...add #temp data...
SELECT lt.ChildID, lt.ChildType
FROM #temp t
INNER JOIN [LargeTable] lt
ON lt.[ChildID] = t.[ChildID]
However the query plan includes an index scan on the large table:
If I change the index to include extra columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX] ON [LargeTable]
(
[ChildID] ASC
)
INCLUDE [ChildType]
Then the query plan changes to something more sensible:
So my question is: Why can't SQL Server still use an index seek in the first scenario, but with a RID lookup to get from the non-clustered index to the table data? Surely that would be more efficient than an index scan on such a large table?

The first query plan actually makes a lot of sense. Remember that SQL Server never reads records, it reads pages. In your table, a page contains many records, since those records are so small.
With the original index, if the second query plan would be used, after finding all the RID's in the index, and reading index pages to do so, pages in the clustered index need to be read to read the ChildType column. In a worst case scenario, that is an entire page for each record it needs to read. As there are many records per page, that might boil down to reading a large percentage of the pages in the clustered index.
SQL server guessed, based on statistics, that simply scanning the pages in the clustered index would require less page reads in total, because it then avoids reading the pages in the non-clustered index.
What matters here is the number of rows in the temp table compared to the number of pages in the large table. Assuming a random distribution of ChildID in the large table, as soon as the number of rows in the temp table approaches or supersedes the number of pages in the large table, SQL server will have to read virtually every page in the large table anyway.

Because the column ChildType isn't covered in an index, it has to go back to the clustered index (with the mentioned Row IDentifier lookup) to get the values for ChildType.
When you INCLUDE this column in the nonclustered index it will be added to the leaf-level of the index where it is available for querying.

Colloquially is called 'the index tipping point'. Basically, at what point does the cost based optimizer consider that is more effective to do a scan rather than seek + lookup. Usually is around 20% of the size, which in your case will base on an estimate coming from the #temp table stats. YMMV.
You already have your answer: include the required column, make the index covering.

sql server multi column index queries

If I have created a single index on two columns [lastName] and [firstName] in that order. If I then do a query to find the number of the people with first name daniel:
SELECT count(*)
FROM people
WHERE firstName = N'daniel'
will this search in each section of the first index (lastname) and use the secondary index (firstName) to quickly search through each of the blocks of LastName entries?
This seems like an obvious thing to do and I assume that it is what happens but you know what they say about assumptions.

Yes, this query may - and probably do - use this index (and do an Index Scan) if the query optimizer thinks that it's better to "quickly search through each of the blocks of LastName entries" as you say than (do an Full Scan) of the table.
An index on (firstName) would be more efficient though for this particular query so if there is such an index, SQL-Server will use that one (and do an Index Seek).
Tested in SQL-Server 2008 R2, Express edition:
CREATE TABLE Test.dbo.people
( lastName NVARCHAR(30) NOT NULL
, firstName NVARCHAR(30) NOT NULL
) ;
INSERT INTO people
VALUES
('Johnes', 'Alex'),
... --- about 300 rows
('Johnes', 'Bill'),
('Brown', 'Bill') ;
Query without any index, Table Scan:
SELECT count(*)
FROM people
WHERE firstName = N'Bill' ;
Query with index on (lastName, firstName), Index Scan:
CREATE INDEX last_first_idx
ON people (lastName, firstName) ;
SELECT ...
Query with index on (firstName), Index Seek:
CREATE INDEX first_idx
ON people (firstName) ;
SELECT ...

If you have an index on (lastname, firstname), in this order, then a query like
WHERE firstname = 'daniel'
won't use the index, as long as you don't include the first column of the composite index (i.e. lastname) in the WHERE clause. To efficiently search for firstname only, you will need a separate index on that column.
If you frequently search on both columns, do 2 separate single column indexes. But keep in mind that each index will be updated on insert/update, so affecting performance.
Also, avoid composite indexes if they aren't covering indexes at the same time. For tips regarding composite indexes see the following article at sql-server-performance.com:
Tips on Optimizing SQL Server Composite Indexes
Update (to address downvoters):
In this specific case of SELECT Count(*) the index is a covering index (as pointed out by #ypercube in the comment), so the optimizer may choose it for execution. Using the index in this case means an Index Scan and not an Index Seek.
Doing an Index Scan means scanning every single row in the index. This will be faster, if the index contains less rows than the whole table. So, if you got a highly selective index (with many unique values) you'll get an index with roughly as many rows as the table itself. In such a case usually there won't be a big difference in doing a Clustered Index Scan (implies a PK on the table, iterates over the PK) or a Non-Clustered Index Scan (iterates over the index). A Table Scan (as seen in the screenshot of #ypercube's answer) means that there is no PK on the table, which results in an even slower execution than a Clustered Index Scan, as it doesn't have the advantage of sequential data alignment on disk given by a PK.

Querying minimum value in SQL Server is a lot longer than querying all the rows

I'm currently confronted with a strange behaviour in my database when I'm querying a minimum ID for a specific date in a table contains about a hundred million rows. The query is quite simple :
SELECT MIN(Id) FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
This query nevers end, at least I let it run for hours. The DateConnection column is not an index neither included in one. So I would understand that this query can last quite a bit. But I tried the following query which runs in few seconds :
SELECT Id FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
It returns 300k rows.
My table is defined as this :
CREATE TABLE [dbo].[Connection](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[DateConnection] [datetime] NOT NULL,
[TimeConnection] [time](7) NOT NULL,
[Hour] AS (datepart(hour,[TimeConnection])) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection] PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
)
And it has the following index :
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id] ON [dbo].[Connection]
(
[Id] ASC
)ON [PRIMARY]
One solutions I find using this strange behaviour is using the following code. But it seems to me quite a bit heavy for such a simple query.
create table #TempId
(
[Id] bigint
)
go
insert into #TempId
select id from partitionned_connection with(nolock) where dateconnection = '2012-06-26'
declare #displayId bigint
select #displayId = min(Id) from #CoIdTest
print #displayId
go
drop table #TempId
go
Has anybody been confronted to this behaviour and what is the cause of it ? Is the minimum aggregate scanning the entire table ? And if this is the case why the simple select does not ?

The root cause of the problem is the non-aligned nonclustered index, combined with the statistical limitation Martin Smith points out (see his answer to another question for details).
Your table is partitioned on [Hour] along these lines:
CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23);
CREATE PARTITION SCHEME PS
AS PARTITION PF ALL TO ([PRIMARY]);
-- Partitioned
CREATE TABLE dbo.Connection
(
Id bigint IDENTITY(1,1) NOT NULL,
DateConnection datetime NOT NULL,
TimeConnection time(7) NOT NULL,
[Hour] AS (DATEPART(HOUR, TimeConnection)) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection]
PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
ON PS ([Hour])
);
-- Not partitioned
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id]
ON dbo.Connection
(
Id ASC
)ON [PRIMARY];
-- Pretend there are lots of rows
UPDATE STATISTICS dbo.Connection WITH ROWCOUNT = 200000000, PAGECOUNT = 4000000;
The query and execution plan are:
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH (READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26';
The optimizer takes advantage of the index (ordered on Id) to transform the MIN aggregate to a TOP (1) - since the minimum value will by definition be the first value encountered in the ordered stream. (If the nonclustered index were also partitioned, the optimizer would not choose this strategy since the required ordering would be lost).
The slight complication is that we also need to apply the predicate in the WHERE clause, which requires a lookup to the base table to fetch the DateConnection value. The statistical limitation Martin mentions explains why the optimizer estimates it will only need to check 119 rows from the ordered index before finding one with a DateConnection value that will match the WHERE clause. The hidden correlation between DateConnection and Id values means this estimate is a very long way off.
In case you are interested, the Compute Scalar calculates which partition to perform the Key Lookup into. For each row from the nonclustered index, it computes an expression like [PtnId1000] = Scalar Operator(RangePartitionNew([dbo].[Connection].[Hour] as [c].[Hour],(1),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23))), and this is used as the leading key of the lookup seek. There is prefetching (read-ahead) on the nested loops join, but this needs to be an ordered prefetch to preserve the sorting required by the TOP (1) optimization.
Solution
We can avoid the statistical limitation (without using query hints) by finding the minimum Id for each Hour value, and then taking the minimum of the per-hour minimums:
-- Global minimum
SELECT
MinID = MIN(PerHour.MinId)
FROM
(
-- Local minimums (for each distinct hour value)
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH(READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26'
GROUP BY
c.[Hour]
) AS PerHour;
The execution plan is:
If parallelism is enabled, you will see a plan more like the following, which uses parallel index scan and multi-threaded stream aggregates to produce the result even faster:

Although it might be wise to fix the problem in a way that doesn't require index hints, a quick solution is this:
SELECT MIN(Id) FROM Connection WITH(NOLOCK, INDEX(PK_Connection)) WHERE DateConnection = '2012-06-26'
This forces a table scan.
Alternatively, try this although it probably produces the same problem:
select top 1 Id
from Connection
WHERE DateConnection = '2012-06-26'
order by Id

It makes sense that finding the minimum takes longer than going through all the records. Finding the minimum of an unsorted structure takes much longer than traversing it once (unsorted because MIN() doesn't take advantage of the identity column). What you could do, since you're using an identity column, is have a nested select, where you take the first record from the set of records with the specified date.

The NC index scan is issue in you case.It is using the unique non clustered index scan and then for each row that is hundred million rows it will traverse the clustered index and thus it causes millions of io's(usually say your index hieght is 4 then it might cause 100million*4 IO's +index scan of the nonclustered index leaf page).Optimizer must have chosen this index to avoid the strem aggregate to get the minimum.To find minimum there are 3 main technique,one is using index on the column for which we want min (it is efficient if there is index and in that case no calc required as soon as you get the row it is returned),2nd it could use hash aggregate (but it usually happens when you have group by) and 3rd is stream aggregate here it will scan through all the rows which are qualified and keep the min value always and return min when all rows are scanned..
Howvere, when the query without min used the clustered index scan and thus is fast as it has to read less number of page and thus less io's.
Now question is why optimizer picked up the index scan on non clustered index.I am sure it is to avoid the compuation involved in stream aggregate to find the min value using stream aggregate but in thise case not using the stream aggregate is much more costly. This depends on estimation so i guess stats are not up to date in the table.
So fist of all check whether your stats are upto date.When was the stats were updated last?
Thus to avoid the issue.Do following
1. First update the table stats and I am sure it must remove your issue.
2. In case, you can not use update stats or update stats doesnt change the plan and still uses the NC index scan then you can force the clustered index scan so that it uses less IO's followed by stream aggregate to get min value.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight