sql server execution plan - nested loop join - sql-server

I have a very simple select on SQL Server:
select * from person
where first_name = 'John' and last_name = 'Smith'`
In the execution plan I have:
Nonclustered Index seek - NC_First_Last_pers
Key lookup (Clustered) on PK
And this two goes into a Nested Loop Join.
My question is:
Why is there the join? I thought that this is used only for joining different tables, but I have only 1 table here.
Thanks!

In the index you have the data for the columns that are included in the index plus the clustered key. You are querying the table with * meaning that you have to go look for all column values and those are stored together with the clustered key.
The query uses the index on the names to find all rows that match and then uses the clustered key to find the data you want.

Related

How to I force a better execution plan when the database is forcing a join?

I am optimizing a query on SQL Server 2005. I have a simple query against mytable that has about 2 million rows:
SELECT id, num
FROM mytable
WHERE t_id = 587
The id field is the Primary Key (clustered index) and there exists a non-clustered index on the t_id field.
The query plan for the above query is including both a Clustered Index Seek and an Index Seek, then it's executing a Nested Loop (Inner Join) to combine the results. The STATISTICS IO is showing 3325 page reads.
If I change the query to just the following, the server is only executing 6 Page Reads and only a single Index Seek with no join:
SELECT id
FROM mytable
WHERE t_id = 587
I have tried adding an index on the num column, and an index on both num and tid. Neither index was selected by the server.
I'm looking to reduce the number of page reads but still retrieve the id and num columns.
The following index should be optimal:
CREATE INDEX idx ON MyTable (t_id)
INCLUDE (num)
I cannot remember if INCLUDEd columns were valid syntax in 2005, you may have to use:
CREATE INDEX idx ON MyTable (t_id, num)
The [id] column will be included in the index as it is the clustered key.
The optimal index would be on (t_id, num, id).
The reason your query is probably one that bad side is because multiple rows are being selected. I wonder if rephrasing the query like this would improve performance:
SELECT t.id, t.num
FROM mytable t
WHERE EXISTS (SELECT 1
FROM my_table t2
WHERE t2.t_id = 587 AND t2.id = t.id
);
Lets clarify the problem and then discuss on the solutions to improve it:
You have a table(lets call it tblTest1 and contains 2M records) with a Clustered Index on id and a Non Clustered Index on t_id, and you are going to query the data which filters the data using Non Clustered Index and getting the id and num columns.
So SQL server will use the Non Clustered Index to filter the data(t_id=587), but after filtering the data SQL server needs to get the values stored in id and num columns. Apparently because you have Clustered index then SQL server will use this index to obtain the data stored in id and num columns. This happens because leafs in the Non clustered index's tree contains the Clustered index's value, this is why you see the Key Lookup operator in the execution plan. In fact SQL Server uses the Index seek(NonCluster) to find the t_id=587 and then uses the Key Lookup to get the num data!(SQL Server will not use this operator to get the value stored in id column, because your have a Clustered index and the leafs in NonClustered Index contains the Clustered Index's value).
Referred to the above-mentioned screenshot, when we have Index Seek(NonClustred) and a Key Lookup, SQL Server needs a Nested Loop Join operator to get the data in num column using the Index Seek(Nonclustered) operator. In fact in this stage SQL Server has two separate sets: one is the results obtained from Nonclustered Index tree and the other is data inside Clustered Index tree.
Based on this story, the problem is clear! What will happen if we say to SQL server, not to do a Key Lookup? this will cause the SQL Server to execute the query using a shorter way(No need to Key Lookup and apparently no need to the Nested loop join! ).
To achieve this, we need to INCLUDE the num column inside the NonClustered index's tree, so in this case the leaf of this index will contains the id column's data and also the num column's data! Clearly when we say the SQL Server to find a data using NonClustred Index and return the id and num columns, it will not need to do a Key Lookup!
Finally what we need to do, is to INCLUDE the num in NonClustered Index! Thanks to #MJH's answer:
CREATE NONCLUSTERED INDEX idx ON tblTest1 (t_id)
INCLUDE (num)
Luckily, SQL Server 2005 provided a new feature for NonClustered indexes, the ability to include additional, non-key columns in the leaf level of the NonClustered indexes!
Read more:
https://www.red-gate.com/simple-talk/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/create-indexes-with-included-columns?view=sql-server-2017
But what will happens if we write the query like this?
SELECT id, num
FROM tblTest1 AS t1
WHERE
EXISTS (SELECT 1
FROM tblTest1 t2
WHERE t2.t_id = 587 AND t2.id = t1.id
)
This is a great approach, but lets see the execution plan:
Clearly, SQL server needs to do a Index seek(NonClustered) to find the t_id=587 and then obtain the data from Clustered Index using Clustered Index Seek. In this case we will not get any notable performance improvement.
Note: When you are using Indexes, you need to have an appropriate plan to maintain them. As the indexes get fragmented, their impact on the query performance will be decreased and you might face performance problems after a while! You need to have an appropriate plan to Reorganize and Rebuild them, when they get fragmented!
Read more: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-2017

SQL Server : join using index containing field in select, instead of join on PK

I'm looking for some information on how Microsoft SQL Server (specifically 2016 in my case) decides on which index to use for joins. Most of the time it seems to work well, but other times I have to scratch my head...
Here is a simple query:
SELECT
OrderMaster.StartDate, Employee.EmpName
FROM
OrderMaster
INNER JOIN
Employee ON OrderMaster.EmpId = Employee.EmpId
EmpId is the PK (int) on the Employee table.
There are 2 indexes on Employee: PK_Employee, which is a clustered PK index on EmpId, and IX_Employee_EmpName which is a nonclustered index, and only uses column EmpName.
This query returns 500 order records, but has performed an Index Scan using IX_Employee_EmpName on the Employee table, which reads 30,000 records.
Why would it have picked this index, instead of doing an index seek on PK_Employee?
What is the preferred method to solve this problem? Would I specify that the join use the PK_Employee index, or should I create a new index on EmpId, but includes the additional columns I may pull from selecting (empname, address, etc)?
This query returns 500 order records, but has performed an Index Scan using IX_Employee_EmpName on the Employee table, which read 30,000 records.
SQL Server is a cost based optimizer, so it tries to choose a plan based on cost and in this case it chose this index IX_Employee_EmpName. Since empid is the clustered key, it is by default added to all nonclustered indexes and SQL Server has the two columns it needs from this narrow index

SQL Server execution plan is suggesting to create an index containing all the columns in the table

I've got a key table with 2 columns: Key, Id.
In a stored procedure I've written, my code joins the Employee table to the Key column, then selects the Id - something like this:
SELECT
E.EmployeeName, K.Id
FROM
Employee E
JOIN
KeyTable K ON E.Key = K.Key
The execution plan is suggesting to create the following index:
[schema].[Employee] ([Key]) INCLUDE ([Id])
My question is why? If all the information is in the table to begin with why create an index and duplicate that information?
Just because all of the information is "in the table", that doesn't mean that searching the entire table is going to be the most efficient way of obtaining the results for this query.
Here, the server is saying that, if it had a way to quickly locate rows in this table, given a Key value, that the query should be able to be processed more quickly (not that it's 100% reliable in its suggestions, so you should test before implementing).
This can be true if the table is a heap (no clustered index) or for a clustered table where the clustering key(s) don't match the desired access order for the query.
Also, if you think about it - every (non-clustered) index duplicates information. It's just that usually its a subset of the information rather than the whole set.

How to update guid ID references when converting to identity IDs

I am trying to convert tables from using guid primary keys / clustered indexes to using int identities. This is for SQL Server 2005. There are two tables MainTable and RelatedTable, and the current table structure is as follows:
MainTable [40 million rows]
IDGuid - uniqueidentifier - PK
-- [data columns]
RelatedTable [400 million rows]
RelatedTableID - uniqueidentifier - PK
MainTableIDGuid - uniqueidentifier [foreign key to MainTable]
SequenceNumber - int - incrementing number per main table entry since there can be multiple entries related to a given row in the main table. These go from 1,2,3... etc for each MainTableIDGuid value.
-- [data columns]
The clustered index for MainTable is currently the primary key (IDGuid). The clustered index for RelatedTable is currently (MainTableIDGuid, SequenceNumber).
I want my conversion is do several things:<
Change MainTable to use an integer ID instead of GUID
Add a MainTableIDInt column to related table that links to Main Table's integer ID
Change the primary key and clustered index of RelatedTable to (MainTableIDInt, SequenceNumber)
Get rid of the guid columns.
I've written a script to do the following:
Add an IDInt int IDENTITY column to MainTable. This does a table rebuild and generates the new identity ID values.
Add a MainTableIDInt int column to RelatedTable.
The next step is to populate the RelatedTable.MainTableIDInt column for each row with its corresponding MainTable.IDInt value [based on the matching guid IDs]. This is the step I'm hung up on. I understand this is not going to be speedy, but I'd like to have it perform as well as possible.
I can write a SQL statement that does this update:
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = (SELECT MainTable.IDInt FROM MainTable WHERE MainTable.IDGuid = RelatedTable.MainTableIDGuid)
or
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = MainTable.IDInt
FROM RelatedTable
LEFT OUTER JOIN MainTable ON RelatedTable.MainTableIDGuid = MainTable.IDGuid
The 'Display Estimated Execution Plan' displays roughly the same for both of these queries. The execution plan it spits out does the following:
Clustered index scans over MainTable and RelatedTable and does a Merge Join on them [estimated number of rows = 400 million]
Sorts [estimated number of rows = 400 million]
Clustered index update over RelatedTable [estimated number of rows = 400 million]
I'm concerned about the performance of this [sorting 400 million rows sounds unpleasant]. Are my concerns about performance of these execution plan justified? Is there a better way to update the new ID for my related table that will scale given the size of the tables?
First, this will be a headache. Second, I wouldn't change any of the indexes or constraints until I had the data in place. I.e., I would add the identity column but not make it the primary key nor clustered index. Then I'd add the soon-to-be new foreign keys to the various tables. Your queries should look like:
Update ChildTable
Set NewIntForeignKeyId = P.NewIntPrimaryKey
From ChildTable As C
Join ParentTable As P
On P.PrimaryKey = C.ForeignKey
First, notice that I'm using an inner join. There is no reason to use an outer join for this type of query given that you will eventually enforce referential integrity between the new columns. Second, if you populate the columns first and then rebuild the constraints, it will be faster as you'll be able to leverage the existing indexes. Remember that when you change the clustered index, it rebuilds all of the nonclustered indexes. If the tables are large, that will be a serious hit.
Once you have the data in place, I'd then drop all primary constraints, unique constraints, foreign key constraints and unique indexes. Drop the clustered index/constraint last. I'd then add the clustered indexes to all of the tables and after that was done, recreate the unique constraints, foreign key constraints and indexes. If you do not drop the existing indexes before you recreate the clustered index, it will rebuild the existing indexes twice: once when you drop the clustered index and again when you recreate it.
Btw, I highly doubt there is a way to avoid table scans for this sort of thing since you are going to be updating every row.

SQL Server Indexes

What's the Need for going for Non-clustered index even though table has clustered index?
For optimal performance you have to create an index for every combination used in your queries. For instance if you have a select like this.
SELECT *
FROM MyTable
WHERE Col_1 = #SomeValue AND
Col_2 = #SomeOtherValue
Then you should do a clustered index with Col_1 and Col_2.
On the other hand if you have an additional query which only looks up one of the Column like:
SELECT *
FROM MyTable
WHERE Col_1 = #SomeValue
Then you should have an index with just the Col_1.
So you end up with two indexes. One with Col_1 and Col_2 and another with just Col_1.
The "need" is to do faster lookups of columns not included in the clustered index.
Don't get clustered indexes confused with indexes across multiple columns. That isn't the same thing.
Here's an article that does a good job of explaining clustered vs. non-clustered indexes.
In mssql server you can only have one clustered index per table, and it's almost always the primary key. A clustered index is "attached" to the table so it doesn't need to go back to the table to get any other data elements that might be in the "select" clause. A non-clustered index is not attached, but contains a reference back to the table row with all the rest of the data.

Resources