We have 10M records in a table in a SQL Server 2012 database and we want to retrieve the top 2000 records based on a condition.
Here's the SQL statement:
SELECT TOP 2000 *
FROM Users
WHERE LastName = 'Stokes'
ORDER BY LastName
I have added a non clustered index to the column LastName and it takes 9secs to retrieve 2000 records. I tried creating an indexed view with an index, created on the same column, but to no avail, it takes about the same time. Is there anything else I can do, to improve on the performance?
Using select * will cause key lookups for all the rows that match your criteria (=for each value of the clustered key, the database has to travel through the clustered index into the leaf level to find the rest of the values).
You can see that in the actual plan, and you can also check that the index you created is actually being used (=index seek for that index). If keylookup is what is the reason for the slowness, the select will go fast if you run just select LastName from ....
If there is actually just few columns you need from the table (or there's not that many columns in the table) you can add those columns in as included columns in your index and that should speed it up. Always specify what fields you need in the select instead of just using select *.
Related
An application (the underlying code of which can't be modified) is running a select statement against a relatively large table (~1.8M rows, 2GB) and creating a huge performance bottleneck on the DB server.
The table itself has approx. 120 columns of varying datatype. The select statement is selecting about 100 of those columns based on values of 2 columns which have both been indexed individually and together.
e.g.
SELECT
column1,
column2,
column3,
column4,
column5,
and so on.....
FROM
ITINDETAIL
WHERE
(column23 = 1 AND column96 = 1463522)
However, SQL Server chooses to ignore the index and instead scans the clustered (PK) index which takes forever (this was cancelled after 42 seconds, it's been known to take over 8 minutes on a busy production DB).
If I simply change the select statement, replacing all columns with just a select count(*) , the index is used and results return in milliseconds.
EDIT: I believe ITINDETAIL_004 (where column23 and column96 have been indexed together) is the index that should be getting used for the original statement.
Confusingly, if I create a non-clustered index on the table for the two columns in the where clause like this:
CREATE NONCLUSTERED INDEX [ITINDETAIL20220504]
ON [ITINDETAIL] ([column23] ASC, [column96] ASC)
INCLUDE (column1, column2, column3, column4, column5,
and so on..... )
and include ALL other columns from the select statement in the index (in my mind, this is essentially just creating a TABLE!), then run the original select statement, the new HUGE index is used and results are returned very quickly:
However, I'm not sure this is the right way to address the problem.
How can I force SQL Server to use what I think is the correct index?
Version is: SQL Server 2017
Based on additional details from the comments, it appears that the index you want SQL Server to use isn't a covering index; this means that the index doesn't contain all the columns that are referenced in the query. As such, if SQL Server were to use said index, then it would need to first do a seek on the index, and then perform a key lookup on the clustered index to get the full details of the row. Such lookups can be expensive.
As a result of the index you want not being covering, SQL Server has determined that the index you want it to to use would produce an inferior query plan to simply scanning the entire clustered index; which is by definition covering as it INCLUDEs all other columns not in the CLUSTERED INDEX.
For your index ITINDETAIL20220504 you have INCLUDEd all the columns that are in your SELECT, which means that it is covering. This means that SQL Server can perform a seek on the index, and get all the information it needs from that seek; which is far less costly that a seek followed by a key lookup and quicker than a scan of the entire clustered index. This is why this information works.
We coould put this into some kind of analogy using a Library type scenario, which is full of Books, to help explain this idea more:
Let's say that the Clustered Index is a list of every book in the library sorted by it's ISBN number (The Primary Key). Along side that ISBN number you have the details of the Author, Title, Publication Date, Publisher, If it's hardcover or softcover, the colour of the spine, the section of the Library the Book is located in, the book case, and the shelf.
Now let's say you want to obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. If you then wanted to you could go through the entire list, one by one, finding the books by that author, checking the publication date, and then writing down it's location so you can go and visit each of those locations and collect the book. This is effectively a Clustered Index Scan.
Now let's say you have a list of all the books in the Library again. The list contains the Author, Publication Date, and the ISBN (The Primary Key), and is ordered by the Author and the Publication Date. You want to fulfil the same task; obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. Now you can easily go through that list and find all those books, but you don't know where they are. As a result even after you have gone straight to the Brandon Sanderson "section" of the list, you'll still need to write all the ISBNs down, and then find each of those ISBN in the original list, get their location and title. This is your index ITINDETAIL_004; you can easily find the rows you want to filter to, but you don't have all the information so you have to go somewhere else afterwards.
Lastly we have a 3rd list, this list is ordered by the author and then publication date (like the 2nd list), but also includes the Title, the section of the Library the Book is located in, the book case, and the shelf, as well as the ISBN (Primary key). This list is ideal for your task; it's in the right order, as you can easily go to Brandon Sanderson and then the first book published on or after 2015-01-01, and it has the title and location of the book. This is your INDEX ITINDETAIL20220504 would be; it has the information in the order you want, and contains all the information you asked for.
Saying all this, you can force SQL Server to choose the index, but as I said in my comment:
In truth, rarely is the index you think is the correct is the correct index if SQL Server thinks otherwise. Despite what some believe, SQL Server is very good at making informed choices about what index(es) it should be using, provided your statistics are up to date, and the cached plan isn't based on old and out of date information.
Let's, however, demonstrate what happens if you do with a simple set up:
CREATE TABLE dbo.SomeTable (ID int IDENTITY(1,1) PRIMARY KEY,
SomeGuid uniqueidentifier DEFAULT NEWID(),
SomeDate date)
GO
INSERT INTO dbo.SomeTable (SomeDate)
SELECT DATEADD(DAY,T.I,'19000101')
FROM dbo.Tally(100000,0) T;
Now we have a table with 100001 rows, with a Primary Key on ID. Now let's do a query which is an overly simplified version of yours:
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';
No surprise, this results in an Clsutered Index Scan:
Ok, let's add an index on SomeDate and run the query again:
CREATE INDEX IX_NonCoveringIndex ON dbo.SomeTable (SomeDate);
GO
--Index Scan of Clustered index
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';
Same result, and SSMS has even suggested an index:
Now, as I mentioned, you can force SQL Server to use a specific index. Let's do that and see what happens:
SELECT ID,
SomeGuid
FROM dbo.SomeTable WITH (INDEX(IX_NonCoveringIndex))
WHERE SomeDate > '20220504';
And this gives exactly the plan I suggested; A key lookup:
This is expensive. In fact, if we turn on the statistics for IO and Time, the query without the index hint took 40ms, the one with the hint took 107ms in the first run. Subsequent runs all had the second query taking around double the time of the first. IO wise the first query has a simple scan and 398 logical reads; the latter had 5 scans and 114403 logical reads!
Now, finally, let's add that covering Index and run:
CREATE INDEX IX_CoveringIndex ON dbo.SomeTable (SomeDate) INCLUDE (SomeGuid);
GO
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';
Here we can see that seek we wanted:
If we look at the IO and times again compared to the prior 2, we get 1 scan, 202 logical reads, and it was running in about 25ms.
I am optimizing a query on SQL Server 2005. I have a simple query against mytable that has about 2 million rows:
SELECT id, num
FROM mytable
WHERE t_id = 587
The id field is the Primary Key (clustered index) and there exists a non-clustered index on the t_id field.
The query plan for the above query is including both a Clustered Index Seek and an Index Seek, then it's executing a Nested Loop (Inner Join) to combine the results. The STATISTICS IO is showing 3325 page reads.
If I change the query to just the following, the server is only executing 6 Page Reads and only a single Index Seek with no join:
SELECT id
FROM mytable
WHERE t_id = 587
I have tried adding an index on the num column, and an index on both num and tid. Neither index was selected by the server.
I'm looking to reduce the number of page reads but still retrieve the id and num columns.
The following index should be optimal:
CREATE INDEX idx ON MyTable (t_id)
INCLUDE (num)
I cannot remember if INCLUDEd columns were valid syntax in 2005, you may have to use:
CREATE INDEX idx ON MyTable (t_id, num)
The [id] column will be included in the index as it is the clustered key.
The optimal index would be on (t_id, num, id).
The reason your query is probably one that bad side is because multiple rows are being selected. I wonder if rephrasing the query like this would improve performance:
SELECT t.id, t.num
FROM mytable t
WHERE EXISTS (SELECT 1
FROM my_table t2
WHERE t2.t_id = 587 AND t2.id = t.id
);
Lets clarify the problem and then discuss on the solutions to improve it:
You have a table(lets call it tblTest1 and contains 2M records) with a Clustered Index on id and a Non Clustered Index on t_id, and you are going to query the data which filters the data using Non Clustered Index and getting the id and num columns.
So SQL server will use the Non Clustered Index to filter the data(t_id=587), but after filtering the data SQL server needs to get the values stored in id and num columns. Apparently because you have Clustered index then SQL server will use this index to obtain the data stored in id and num columns. This happens because leafs in the Non clustered index's tree contains the Clustered index's value, this is why you see the Key Lookup operator in the execution plan. In fact SQL Server uses the Index seek(NonCluster) to find the t_id=587 and then uses the Key Lookup to get the num data!(SQL Server will not use this operator to get the value stored in id column, because your have a Clustered index and the leafs in NonClustered Index contains the Clustered Index's value).
Referred to the above-mentioned screenshot, when we have Index Seek(NonClustred) and a Key Lookup, SQL Server needs a Nested Loop Join operator to get the data in num column using the Index Seek(Nonclustered) operator. In fact in this stage SQL Server has two separate sets: one is the results obtained from Nonclustered Index tree and the other is data inside Clustered Index tree.
Based on this story, the problem is clear! What will happen if we say to SQL server, not to do a Key Lookup? this will cause the SQL Server to execute the query using a shorter way(No need to Key Lookup and apparently no need to the Nested loop join! ).
To achieve this, we need to INCLUDE the num column inside the NonClustered index's tree, so in this case the leaf of this index will contains the id column's data and also the num column's data! Clearly when we say the SQL Server to find a data using NonClustred Index and return the id and num columns, it will not need to do a Key Lookup!
Finally what we need to do, is to INCLUDE the num in NonClustered Index! Thanks to #MJH's answer:
CREATE NONCLUSTERED INDEX idx ON tblTest1 (t_id)
INCLUDE (num)
Luckily, SQL Server 2005 provided a new feature for NonClustered indexes, the ability to include additional, non-key columns in the leaf level of the NonClustered indexes!
Read more:
https://www.red-gate.com/simple-talk/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/create-indexes-with-included-columns?view=sql-server-2017
But what will happens if we write the query like this?
SELECT id, num
FROM tblTest1 AS t1
WHERE
EXISTS (SELECT 1
FROM tblTest1 t2
WHERE t2.t_id = 587 AND t2.id = t1.id
)
This is a great approach, but lets see the execution plan:
Clearly, SQL server needs to do a Index seek(NonClustered) to find the t_id=587 and then obtain the data from Clustered Index using Clustered Index Seek. In this case we will not get any notable performance improvement.
Note: When you are using Indexes, you need to have an appropriate plan to maintain them. As the indexes get fragmented, their impact on the query performance will be decreased and you might face performance problems after a while! You need to have an appropriate plan to Reorganize and Rebuild them, when they get fragmented!
Read more: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-2017
I'm looking for some information on how Microsoft SQL Server (specifically 2016 in my case) decides on which index to use for joins. Most of the time it seems to work well, but other times I have to scratch my head...
Here is a simple query:
SELECT
OrderMaster.StartDate, Employee.EmpName
FROM
OrderMaster
INNER JOIN
Employee ON OrderMaster.EmpId = Employee.EmpId
EmpId is the PK (int) on the Employee table.
There are 2 indexes on Employee: PK_Employee, which is a clustered PK index on EmpId, and IX_Employee_EmpName which is a nonclustered index, and only uses column EmpName.
This query returns 500 order records, but has performed an Index Scan using IX_Employee_EmpName on the Employee table, which reads 30,000 records.
Why would it have picked this index, instead of doing an index seek on PK_Employee?
What is the preferred method to solve this problem? Would I specify that the join use the PK_Employee index, or should I create a new index on EmpId, but includes the additional columns I may pull from selecting (empname, address, etc)?
This query returns 500 order records, but has performed an Index Scan using IX_Employee_EmpName on the Employee table, which read 30,000 records.
SQL Server is a cost based optimizer, so it tries to choose a plan based on cost and in this case it chose this index IX_Employee_EmpName. Since empid is the clustered key, it is by default added to all nonclustered indexes and SQL Server has the two columns it needs from this narrow index
I have a log table with two columns.
DocumentType (varchar(250), not unique, not null)
DateEntered (Date, not unique, not null)
The table will only have rows inserted, never updated or deleted.
Here is the stored procedure for the report:
SELECT DocumentType,
COUNT(DocumentType) AS "CountOfDocs"
FROM DocumentTypes
WHERE DateEntered>= #StartDate AND DateEntered<= #EndDate
GROUP BY DocumentType
ORDER BY DocumentType ASC;
In the future user may want to also filter by document type in a different report. I currently have a non-clustered index containing both columns. Is this the proper index to create?
Clustered index on the date, for sure.
I think your NCI is fine. I would say both in as named columns as I assume you will have the date in the WHERE clause for your queries. I don't think 1000 per day worst case scenario will have a major impact on insert times when loading the data.
Don't add any index. It'll be heap table and wait for your "future you" with task to select something from this table :).
If you want index:
With heap: Add index on field you will filter and if the second one is only in select (=isn't in where clause) put the second one as included column. If you'll filter with both column put index on both columns.
If you want add clustered index (for example on new autoincrement primary key column) add only one index on col you want filter or try to don't add aditional index and check execution plan and efectivity - in most cases is clustered index with seeks enough.
Don't create clustered index on nonunique columns (it's used only in very special cases).
If I have created a single index on two columns [lastName] and [firstName] in that order. If I then do a query to find the number of the people with first name daniel:
SELECT count(*)
FROM people
WHERE firstName = N'daniel'
will this search in each section of the first index (lastname) and use the secondary index (firstName) to quickly search through each of the blocks of LastName entries?
This seems like an obvious thing to do and I assume that it is what happens but you know what they say about assumptions.
Yes, this query may - and probably do - use this index (and do an Index Scan) if the query optimizer thinks that it's better to "quickly search through each of the blocks of LastName entries" as you say than (do an Full Scan) of the table.
An index on (firstName) would be more efficient though for this particular query so if there is such an index, SQL-Server will use that one (and do an Index Seek).
Tested in SQL-Server 2008 R2, Express edition:
CREATE TABLE Test.dbo.people
( lastName NVARCHAR(30) NOT NULL
, firstName NVARCHAR(30) NOT NULL
) ;
INSERT INTO people
VALUES
('Johnes', 'Alex'),
... --- about 300 rows
('Johnes', 'Bill'),
('Brown', 'Bill') ;
Query without any index, Table Scan:
SELECT count(*)
FROM people
WHERE firstName = N'Bill' ;
Query with index on (lastName, firstName), Index Scan:
CREATE INDEX last_first_idx
ON people (lastName, firstName) ;
SELECT ...
Query with index on (firstName), Index Seek:
CREATE INDEX first_idx
ON people (firstName) ;
SELECT ...
If you have an index on (lastname, firstname), in this order, then a query like
WHERE firstname = 'daniel'
won't use the index, as long as you don't include the first column of the composite index (i.e. lastname) in the WHERE clause. To efficiently search for firstname only, you will need a separate index on that column.
If you frequently search on both columns, do 2 separate single column indexes. But keep in mind that each index will be updated on insert/update, so affecting performance.
Also, avoid composite indexes if they aren't covering indexes at the same time. For tips regarding composite indexes see the following article at sql-server-performance.com:
Tips on Optimizing SQL Server Composite Indexes
Update (to address downvoters):
In this specific case of SELECT Count(*) the index is a covering index (as pointed out by #ypercube in the comment), so the optimizer may choose it for execution. Using the index in this case means an Index Scan and not an Index Seek.
Doing an Index Scan means scanning every single row in the index. This will be faster, if the index contains less rows than the whole table. So, if you got a highly selective index (with many unique values) you'll get an index with roughly as many rows as the table itself. In such a case usually there won't be a big difference in doing a Clustered Index Scan (implies a PK on the table, iterates over the PK) or a Non-Clustered Index Scan (iterates over the index). A Table Scan (as seen in the screenshot of #ypercube's answer) means that there is no PK on the table, which results in an even slower execution than a Clustered Index Scan, as it doesn't have the advantage of sequential data alignment on disk given by a PK.