sql server multi column index queries - sql-server

If I have created a single index on two columns [lastName] and [firstName] in that order. If I then do a query to find the number of the people with first name daniel:
SELECT count(*)
FROM people
WHERE firstName = N'daniel'
will this search in each section of the first index (lastname) and use the secondary index (firstName) to quickly search through each of the blocks of LastName entries?
This seems like an obvious thing to do and I assume that it is what happens but you know what they say about assumptions.

Yes, this query may - and probably do - use this index (and do an Index Scan) if the query optimizer thinks that it's better to "quickly search through each of the blocks of LastName entries" as you say than (do an Full Scan) of the table.
An index on (firstName) would be more efficient though for this particular query so if there is such an index, SQL-Server will use that one (and do an Index Seek).
Tested in SQL-Server 2008 R2, Express edition:
CREATE TABLE Test.dbo.people
( lastName NVARCHAR(30) NOT NULL
, firstName NVARCHAR(30) NOT NULL
) ;
INSERT INTO people
VALUES
('Johnes', 'Alex'),
... --- about 300 rows
('Johnes', 'Bill'),
('Brown', 'Bill') ;
Query without any index, Table Scan:
SELECT count(*)
FROM people
WHERE firstName = N'Bill' ;
Query with index on (lastName, firstName), Index Scan:
CREATE INDEX last_first_idx
ON people (lastName, firstName) ;
SELECT ...
Query with index on (firstName), Index Seek:
CREATE INDEX first_idx
ON people (firstName) ;
SELECT ...

If you have an index on (lastname, firstname), in this order, then a query like
WHERE firstname = 'daniel'
won't use the index, as long as you don't include the first column of the composite index (i.e. lastname) in the WHERE clause. To efficiently search for firstname only, you will need a separate index on that column.
If you frequently search on both columns, do 2 separate single column indexes. But keep in mind that each index will be updated on insert/update, so affecting performance.
Also, avoid composite indexes if they aren't covering indexes at the same time. For tips regarding composite indexes see the following article at sql-server-performance.com:
Tips on Optimizing SQL Server Composite Indexes
Update (to address downvoters):
In this specific case of SELECT Count(*) the index is a covering index (as pointed out by #ypercube in the comment), so the optimizer may choose it for execution. Using the index in this case means an Index Scan and not an Index Seek.
Doing an Index Scan means scanning every single row in the index. This will be faster, if the index contains less rows than the whole table. So, if you got a highly selective index (with many unique values) you'll get an index with roughly as many rows as the table itself. In such a case usually there won't be a big difference in doing a Clustered Index Scan (implies a PK on the table, iterates over the PK) or a Non-Clustered Index Scan (iterates over the index). A Table Scan (as seen in the screenshot of #ypercube's answer) means that there is no PK on the table, which results in an even slower execution than a Clustered Index Scan, as it doesn't have the advantage of sequential data alignment on disk given by a PK.

Related

How to I force a better execution plan when the database is forcing a join?

I am optimizing a query on SQL Server 2005. I have a simple query against mytable that has about 2 million rows:
SELECT id, num
FROM mytable
WHERE t_id = 587
The id field is the Primary Key (clustered index) and there exists a non-clustered index on the t_id field.
The query plan for the above query is including both a Clustered Index Seek and an Index Seek, then it's executing a Nested Loop (Inner Join) to combine the results. The STATISTICS IO is showing 3325 page reads.
If I change the query to just the following, the server is only executing 6 Page Reads and only a single Index Seek with no join:
SELECT id
FROM mytable
WHERE t_id = 587
I have tried adding an index on the num column, and an index on both num and tid. Neither index was selected by the server.
I'm looking to reduce the number of page reads but still retrieve the id and num columns.
The following index should be optimal:
CREATE INDEX idx ON MyTable (t_id)
INCLUDE (num)
I cannot remember if INCLUDEd columns were valid syntax in 2005, you may have to use:
CREATE INDEX idx ON MyTable (t_id, num)
The [id] column will be included in the index as it is the clustered key.
The optimal index would be on (t_id, num, id).
The reason your query is probably one that bad side is because multiple rows are being selected. I wonder if rephrasing the query like this would improve performance:
SELECT t.id, t.num
FROM mytable t
WHERE EXISTS (SELECT 1
FROM my_table t2
WHERE t2.t_id = 587 AND t2.id = t.id
);
Lets clarify the problem and then discuss on the solutions to improve it:
You have a table(lets call it tblTest1 and contains 2M records) with a Clustered Index on id and a Non Clustered Index on t_id, and you are going to query the data which filters the data using Non Clustered Index and getting the id and num columns.
So SQL server will use the Non Clustered Index to filter the data(t_id=587), but after filtering the data SQL server needs to get the values stored in id and num columns. Apparently because you have Clustered index then SQL server will use this index to obtain the data stored in id and num columns. This happens because leafs in the Non clustered index's tree contains the Clustered index's value, this is why you see the Key Lookup operator in the execution plan. In fact SQL Server uses the Index seek(NonCluster) to find the t_id=587 and then uses the Key Lookup to get the num data!(SQL Server will not use this operator to get the value stored in id column, because your have a Clustered index and the leafs in NonClustered Index contains the Clustered Index's value).
Referred to the above-mentioned screenshot, when we have Index Seek(NonClustred) and a Key Lookup, SQL Server needs a Nested Loop Join operator to get the data in num column using the Index Seek(Nonclustered) operator. In fact in this stage SQL Server has two separate sets: one is the results obtained from Nonclustered Index tree and the other is data inside Clustered Index tree.
Based on this story, the problem is clear! What will happen if we say to SQL server, not to do a Key Lookup? this will cause the SQL Server to execute the query using a shorter way(No need to Key Lookup and apparently no need to the Nested loop join! ).
To achieve this, we need to INCLUDE the num column inside the NonClustered index's tree, so in this case the leaf of this index will contains the id column's data and also the num column's data! Clearly when we say the SQL Server to find a data using NonClustred Index and return the id and num columns, it will not need to do a Key Lookup!
Finally what we need to do, is to INCLUDE the num in NonClustered Index! Thanks to #MJH's answer:
CREATE NONCLUSTERED INDEX idx ON tblTest1 (t_id)
INCLUDE (num)
Luckily, SQL Server 2005 provided a new feature for NonClustered indexes, the ability to include additional, non-key columns in the leaf level of the NonClustered indexes!
Read more:
https://www.red-gate.com/simple-talk/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/create-indexes-with-included-columns?view=sql-server-2017
But what will happens if we write the query like this?
SELECT id, num
FROM tblTest1 AS t1
WHERE
EXISTS (SELECT 1
FROM tblTest1 t2
WHERE t2.t_id = 587 AND t2.id = t1.id
)
This is a great approach, but lets see the execution plan:
Clearly, SQL server needs to do a Index seek(NonClustered) to find the t_id=587 and then obtain the data from Clustered Index using Clustered Index Seek. In this case we will not get any notable performance improvement.
Note: When you are using Indexes, you need to have an appropriate plan to maintain them. As the indexes get fragmented, their impact on the query performance will be decreased and you might face performance problems after a while! You need to have an appropriate plan to Reorganize and Rebuild them, when they get fragmented!
Read more: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-2017

Why index setting is able to affect query cost when scan is imperative

I'm having a review of performance tuning study and practicing with AdventureWorks2012.
I built 4 copies from Product table then setup with the following indexes.
--tmpProduct1 nothing
CREATE CLUSTERED INDEX cIdx ON tmpProduct2 (ProductID ASC)
CREATE NONCLUSTERED INDEX ncIdx ON tmpProduct3 (ProductID ASC)
CREATE NONCLUSTERED INDEX ncIdx ON tmpProduct4 (ProductID ASC) INCLUDE (Name, ProductNumber)
Then I do the execution plan with following queries.
SELECT ProductID FROM tmpProduct1
SELECT ProductID FROM tmpProduct2
SELECT ProductID FROM tmpProduct3
SELECT ProductID FROM tmpProduct4
I expected the performance should be the same to all four of them since they all need to scan. Plus, I select only ProductID column and there is no WHERE condition.
However, it turns out to be
Why is clustered index more expensive than non-clustered index?
Why non-clustered index reduce the cost in this scenario?
Why columns store makes query4 cost more than query3?
For query1 without indexes, you are scanning entire table..
For query2 ,you have a clustered index,but then again..you are scanning the entire table..any index is usefull only when you use to eliminate rows..so this is same as query1
Reason for query4 cost more than query 3 may be due to the index you have and the way indexes are stored..For know ,it is enough to know keys are stored at root level and data is stored at leaf level...For more info read this :https://www.sqlskills.com/blogs/kimberly/category/indexes/
For query3,there is only key,so the number of pages required to store the data will be less and thus requires less traversal
For query 4, you have few more columns,thus more pages and more traversal
Below screenshot shows you the pages tmproduct4(18),tmproduct3(15)..so the extra cost may be IO cost required to traverse additional pages

Log table with non unique columns; what indexes to create

I have a log table with two columns.
DocumentType (varchar(250), not unique, not null)
DateEntered (Date, not unique, not null)
The table will only have rows inserted, never updated or deleted.
Here is the stored procedure for the report:
SELECT DocumentType,
COUNT(DocumentType) AS "CountOfDocs"
FROM DocumentTypes
WHERE DateEntered>= #StartDate AND DateEntered<= #EndDate
GROUP BY DocumentType
ORDER BY DocumentType ASC;
In the future user may want to also filter by document type in a different report. I currently have a non-clustered index containing both columns. Is this the proper index to create?
Clustered index on the date, for sure.
I think your NCI is fine. I would say both in as named columns as I assume you will have the date in the WHERE clause for your queries. I don't think 1000 per day worst case scenario will have a major impact on insert times when loading the data.
Don't add any index. It'll be heap table and wait for your "future you" with task to select something from this table :).
If you want index:
With heap: Add index on field you will filter and if the second one is only in select (=isn't in where clause) put the second one as included column. If you'll filter with both column put index on both columns.
If you want add clustered index (for example on new autoincrement primary key column) add only one index on col you want filter or try to don't add aditional index and check execution plan and efectivity - in most cases is clustered index with seeks enough.
Don't create clustered index on nonunique columns (it's used only in very special cases).

SQL Server 2014 Index Optimization: Any benefit with including primary key in indexes?

After running a query, the SQL Server 2014 Actual Query Plan shows a missing index like below:
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1) INCLUDE
(PK_Column,SomeOtherColumn)
The missing index suggests to include the Primary Key column in the index. The table is clustered index with the PK_Column.
I am confused and it seems that I don’t get the concept of Clustered Index Primary Key right.
My assumption was: when a table has a clustered PK, all of the non-clustered indexes point to the PK value. Am I correct? If I am, why the query plan missing index asks me to include the PK column in the index?
Summary:
Index advised is not valid,but it doesn't make any difference.See below tests section for details..
After researching for some time,found an answer here and below statement explains convincingly about missing index feature..
they only look at a single query, or a single operation within a single query. They don't take into account what already exists or your other query patterns.
You still need a thinking human being to analyze the overall indexing strategy and make sure that you index structure is efficient and cohesive.
So coming to your question,this index advised may be valid ,but should not to be taken for granted. The index advised is useful for SQL Server for the particular query executed, to reduce cost.
This is the index that was advised..
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1)
INCLUDE (PK_Column, SomeOtherColumn)
Assume you have a query like below..
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
SQL Server tries to scan a narrow index as well if available, so in this case an index as advised will be helpful..
Further you didn't share the schema of table, if you have an index like below
create index nci_test on table(column1)
and a query of below form will advise again same index as stated in question
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
Update :
i have orders table with below schema..
[orderid] [int] NOT NULL Primary key,
[custid] [char](11) NOT NULL,
[empid] [int] NOT NULL,
[shipperid] [varchar](5) NOT NULL,
[orderdate] [date] NOT NULL,
[filler] [char](160) NOT NULL
Now i created one more index of below structure..
create index onlyempid on orderstest(empid)
Now when i have a query of below form
select empid,orderid,orderdate --6.3 units
from orderstest
where empid=5
index advisor will advise below missing index .
CREATE NONCLUSTERED INDEX empidalongwithorderiddate
ON [dbo].[orderstest] ([empid])
INCLUDE ([orderid],[orderdate])--you can drop orderid too ,it doesnt make any difference
If you can see orderid is also included in above suggestion
now lets create it and observe both structures..
---Root level-------
For index onlyempid..
for index empidalongwithorderiddate
----leaf level-------
For index onlyempid..
for index empidalongwithorderiddate
As you can see , creating as per suggestion makes no difference,Even though it is invalid.
I Assume suggestion was made by Index advisor based on query ran and is specifically for the query and it has no idea of other indexes involved
I don't know your schema, nor your queries. Just guessing.
Please correct me if this theory is incorrect.
You are right that non-clustered indexes point to the PK value. Imagine you have large database (for example gigabytes of files) stored on ordinary platter hard-drive. Lets suppose that the disk is fragmented and the PK_index is saved physical far from your Table1 Index.
Imagine that your query need to evaluate Column1 and PK_column as well. The query execution read Column1 value, then PK_value, then Column1 value, then PK_value...
The hard-drive platter is spinning from one physical place to another, this can take time.
Having all you need in one index is more effective, because it means reading one file sequentially.

Why does SQL Server use an Index Scan instead of a Seek + RID lookup?

I have a table with approx. 135M rows:
CREATE TABLE [LargeTable]
(
[ID] UNIQUEIDENTIFIER NOT NULL,
[ChildID] UNIQUEIDENTIFIER NOT NULL,
[ChildType] INT NOT NULL
)
It has a non-clustered index with no included columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX]
ON [LargeTable]
(
[ChildID] ASC
)
(It is clustered on ID).
I wish to join this against a temporary table which contains a few thousand rows:
CREATE TABLE #temp
(
ChildID UNIQUEIDENTIFIER PRIMARY KEY,
ChildType INT
)
...add #temp data...
SELECT lt.ChildID, lt.ChildType
FROM #temp t
INNER JOIN [LargeTable] lt
ON lt.[ChildID] = t.[ChildID]
However the query plan includes an index scan on the large table:
If I change the index to include extra columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX] ON [LargeTable]
(
[ChildID] ASC
)
INCLUDE [ChildType]
Then the query plan changes to something more sensible:
So my question is: Why can't SQL Server still use an index seek in the first scenario, but with a RID lookup to get from the non-clustered index to the table data? Surely that would be more efficient than an index scan on such a large table?
The first query plan actually makes a lot of sense. Remember that SQL Server never reads records, it reads pages. In your table, a page contains many records, since those records are so small.
With the original index, if the second query plan would be used, after finding all the RID's in the index, and reading index pages to do so, pages in the clustered index need to be read to read the ChildType column. In a worst case scenario, that is an entire page for each record it needs to read. As there are many records per page, that might boil down to reading a large percentage of the pages in the clustered index.
SQL server guessed, based on statistics, that simply scanning the pages in the clustered index would require less page reads in total, because it then avoids reading the pages in the non-clustered index.
What matters here is the number of rows in the temp table compared to the number of pages in the large table. Assuming a random distribution of ChildID in the large table, as soon as the number of rows in the temp table approaches or supersedes the number of pages in the large table, SQL server will have to read virtually every page in the large table anyway.
Because the column ChildType isn't covered in an index, it has to go back to the clustered index (with the mentioned Row IDentifier lookup) to get the values for ChildType.
When you INCLUDE this column in the nonclustered index it will be added to the leaf-level of the index where it is available for querying.
Colloquially is called 'the index tipping point'. Basically, at what point does the cost based optimizer consider that is more effective to do a scan rather than seek + lookup. Usually is around 20% of the size, which in your case will base on an estimate coming from the #temp table stats. YMMV.
You already have your answer: include the required column, make the index covering.

Resources