Log table with non unique columns; what indexes to create - sql-server

I have a log table with two columns.
DocumentType (varchar(250), not unique, not null)
DateEntered (Date, not unique, not null)
The table will only have rows inserted, never updated or deleted.
Here is the stored procedure for the report:
SELECT DocumentType,
COUNT(DocumentType) AS "CountOfDocs"
FROM DocumentTypes
WHERE DateEntered>= #StartDate AND DateEntered<= #EndDate
GROUP BY DocumentType
ORDER BY DocumentType ASC;
In the future user may want to also filter by document type in a different report. I currently have a non-clustered index containing both columns. Is this the proper index to create?

Clustered index on the date, for sure.
I think your NCI is fine. I would say both in as named columns as I assume you will have the date in the WHERE clause for your queries. I don't think 1000 per day worst case scenario will have a major impact on insert times when loading the data.

Don't add any index. It'll be heap table and wait for your "future you" with task to select something from this table :).
If you want index:
With heap: Add index on field you will filter and if the second one is only in select (=isn't in where clause) put the second one as included column. If you'll filter with both column put index on both columns.
If you want add clustered index (for example on new autoincrement primary key column) add only one index on col you want filter or try to don't add aditional index and check execution plan and efectivity - in most cases is clustered index with seeks enough.
Don't create clustered index on nonunique columns (it's used only in very special cases).

Related

Does SQL Server allow including a computed column in a non-clustered index? If not, why not?

When a column is included in non-clustered index, SQL Server copies the values for that column from the table into the index structure (B+ tree). Included columns don't require table look up.
If the included column is essentially a copy of original data, why does not SQL Server also allow including computed columns in the non-clustered index - applying the computations when it is copying/updating the data from table to index structure? Or am I just not getting the syntax right here?
Assume:
DateOpened is datetime
PlanID is varchar(6)
This works:
create nonclustered index ixn_DateOpened_CustomerAccount
on dbo.CustomerAccount(DateOpened)
include(PlanID)
This does not work with left(PlanID, 3):
create nonclustered index ixn_DateOpened_CustomerAccount
on dbo.CustomerAccount(DateOpened)
include(left(PlanID, 3))
or
create nonclustered index ixn_DateOpened_CustomerAccount
on dbo.CustomerAccount(DateOpened)
include(left(PlanID, 3) as PlanType)
My use case is somewhat like below query.
select
case
when left(PlanID, 3) = '100' then 'Basic'
else 'Professional'
end as 'PlanType'
from
CustomerAccount
where
DateOpened between '2016-01-01 00:00:00.000' and '2017-01-01 00:00:00.000'
The query cares only for the left 3 of PlanID and I was wondering instead of computing it every time the query runs, I would include left(PlanID, 3) in the non-clustered index so the computations are done when the index is built/updated (fewer times) instead at the query time (frequently)
EDIT: We use SQL Server 2014.
As Laughing Vergil stated - you CAN index persisted columns provided that they are persisted. You have a few options, here's a couple:
Option 1: Create the column as PERSISTED then index it
(or, in your case, include it in the index)
First the sample data:
CREATE TABLE dbo.CustomerAccount
(
PlanID int PRIMARY KEY,
DateOpened datetime NOT NULL,
First3 AS LEFT(PlanID,3) PERSISTED
);
INSERT dbo.CustomerAccount (PlanID, DateOpened)
VALUES (100123, '20160114'), (100999, '20151210'), (255657, '20150617');
and here's the index:
CREATE NONCLUSTERED INDEX nc_CustomerAccount ON dbo.CustomerAccount(DateOpened)
INCLUDE (First3);
Now let's test:
-- Note: IIF is available for SQL Server 2012+ and is cleaner
SELECT PlanID, PlanType = IIF(First3 = 100, 'Basic', 'Professional')
FROM dbo.CustomerAccount;
Execution Plan:
As you can see- the optimizer picked the nonclustered index.
Option #2: Perform the CASE logic inside your table DDL
First the updated table structure:
DROP TABLE dbo.CustomerAccount;
CREATE TABLE dbo.CustomerAccount
(
PlanID int PRIMARY KEY,
DateOpened datetime NOT NULL,
PlanType AS
CASE -- NOTE: casting as varchar(12) will make the column a varchar(12) column:
WHEN LEFT(PlanID,3) = 100 THEN CAST('Basic' AS varchar(12))
ELSE 'Professional'
END
PERSISTED
);
INSERT dbo.CustomerAccount (PlanID, DateOpened)
VALUES (100123, '20160114'), (100999, '20151210'), (255657, '20150617');
Notice that I use CAST to assign the data type, the table will be created with this column as varchar(12).
Now the index:
CREATE NONCLUSTERED INDEX nc_CustomerAccount ON dbo.CustomerAccount(DateOpened)
INCLUDE (PlanType);
Let's test again:
SELECT DateOpened, PlanType FROM dbo.CustomerAccount;
Execution plan:
... again, it used the nonclustered index
A third option, which I don't have time to go into, would be to create an indexed view. This would be a good option for you if you were unable to change your existing table structure.
SQL Server 2014 allows creating indexes on computed columns, but you're not doing that -- you're attempting to create the index directly on an expression. This is not allowed. You'll have to make PlanType a column first:
ALTER TABLE dbo.CustomerAccount ADD PlanType AS LEFT(PlanID, 3);
And now creating the index will work just fine (if your SET options are all correct, as outlined here):
CREATE INDEX ixn_DateOpened_CustomerAccount ON CustomerAccount(DateOpened) INCLUDE (PlanType)
It is not required that you mark the column PERSISTED. This is required only if the column is not precise, which does not apply here (this is a concern only for floating-point data).
Incidentally, the real benefit of this index is not so much that LEFT(PlanType, 3) is precalculated (the calculation is inexpensive), but that no clustered index lookup is needed to get at PlanID. With an index only on DateOpened, a query like
SELECT PlanType FROM CustomerAccounts WHERE DateOpened >= '2012-01-01'
will result in an index seek on CustomerAccounts, followed by a clustered index lookup to get PlanID (so we can calculate PlanType). If the index does include PlanType, the index is covering and the extra lookup disappears.
This benefit is relevant only if the index is truly covering, however. If you select other columns from the table, an index lookup is still required and the included computed column is only taking up space for little gain. Likewise, suppose that you had multiple calculations on PlanID or you needed PlanID itself as well -- in this case it would make much more sense to include PlanID directly rather than PlanType.
Computed columns are only allowed in indexes if they are Persisted - that is, if the data is written to the table. If the information is not persisted, then the information isn't even calculated / available until the field is queried.

Added an Index to a field and it's still running slow

We have 10M records in a table in a SQL Server 2012 database and we want to retrieve the top 2000 records based on a condition.
Here's the SQL statement:
SELECT TOP 2000 *
FROM Users
WHERE LastName = 'Stokes'
ORDER BY LastName
I have added a non clustered index to the column LastName and it takes 9secs to retrieve 2000 records. I tried creating an indexed view with an index, created on the same column, but to no avail, it takes about the same time. Is there anything else I can do, to improve on the performance?
Using select * will cause key lookups for all the rows that match your criteria (=for each value of the clustered key, the database has to travel through the clustered index into the leaf level to find the rest of the values).
You can see that in the actual plan, and you can also check that the index you created is actually being used (=index seek for that index). If keylookup is what is the reason for the slowness, the select will go fast if you run just select LastName from ....
If there is actually just few columns you need from the table (or there's not that many columns in the table) you can add those columns in as included columns in your index and that should speed it up. Always specify what fields you need in the select instead of just using select *.

Non CLustered Index reference on the composite column

I have a non-clustered index on a datetime column (AdmitDate) and a varchar column (Status) in SQL Server. Now the issue is that I'm filtering the result only on the basis of the datetime column (no Index on AdmitDate column alone).
In order for me to utilize the non-clustered index I used a not null condition for the varchar column (Status) but in that scenario the execution plan shows "Index Scan".
select ClientName, ID
from PatientVisit
where
(PatientVisit.AdmitDate between '2010-01-01 00:00:00.000' AND '2014-01-31 00:00:00.000' )
AND PatientVisit.Status is not null
-- Index Scan
But if I pass a specific Status value then as expected the excution plan shows Index Seek.
select ClientName, ID
from PatientVisit
where
(PatientVisit.AdmitDate between '2010-01-01 00:00:00.000' AND '2014-01-31 00:00:00.000')
AND PatientVisit.Status = 'ADM'
--Index Seek
Should I use in operator and pass all the possible values for the Status column to utilize the non-clustered index?
Or is there any other way to utilize the index?
Thanks,
Shubham
You're using SELECT ClientID, Name and you fetch columns that are not part of the index, SQL Server will need to go to the actual data page to get those column values.
So if SQL Server finds a match in the non-clustered index, it will have to do an (expensive) key-lookup into the clustered index to fetch the data page, which contains all columns.
If too many rows have a Status that is NULL, SQL Server will come to the conclusion that it's faster just the bloody scan the whole index, rather then doing a great many index seeks and key lookups. In the other case, when you define a specific value, and that matches only a few (or only one) rows, then it might be faster to actually do the index seek and one expensive key lookup.
You thing you could try is to use an index which includes those two columns that you need for your SELECT:
CREATE NONCLUSTERED INDEX IX_PatientVisit_DateStatusIncluded
ON dbo.PatientVisit(AdmitDate, Status)
INCLUDE(ClientID, Name)
Now in this case, SQL Server could find the values it needs to satisfy this query in the index leaf page, so it will be a lot more likely to actually use that index - even if it finds a lot of hits - possibly with an Index Scan on that small index (which isn't bad, either!)
Create a filtered index. You can then create an index for the datetime field only for values where status is not null.
CREATE NONCLUSTERED INDEX FI_IX_AdmitDate_StatusNotNull
ON dbo.PatientVisit(AdmitDate)
WHERE Status IS NOT NULL
This will be used for your query where Status IS NOT NULL and your existing index will be used for queries where Status = 'ASpecificValue'

Why does SQL Server use an Index Scan instead of a Seek + RID lookup?

I have a table with approx. 135M rows:
CREATE TABLE [LargeTable]
(
[ID] UNIQUEIDENTIFIER NOT NULL,
[ChildID] UNIQUEIDENTIFIER NOT NULL,
[ChildType] INT NOT NULL
)
It has a non-clustered index with no included columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX]
ON [LargeTable]
(
[ChildID] ASC
)
(It is clustered on ID).
I wish to join this against a temporary table which contains a few thousand rows:
CREATE TABLE #temp
(
ChildID UNIQUEIDENTIFIER PRIMARY KEY,
ChildType INT
)
...add #temp data...
SELECT lt.ChildID, lt.ChildType
FROM #temp t
INNER JOIN [LargeTable] lt
ON lt.[ChildID] = t.[ChildID]
However the query plan includes an index scan on the large table:
If I change the index to include extra columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX] ON [LargeTable]
(
[ChildID] ASC
)
INCLUDE [ChildType]
Then the query plan changes to something more sensible:
So my question is: Why can't SQL Server still use an index seek in the first scenario, but with a RID lookup to get from the non-clustered index to the table data? Surely that would be more efficient than an index scan on such a large table?
The first query plan actually makes a lot of sense. Remember that SQL Server never reads records, it reads pages. In your table, a page contains many records, since those records are so small.
With the original index, if the second query plan would be used, after finding all the RID's in the index, and reading index pages to do so, pages in the clustered index need to be read to read the ChildType column. In a worst case scenario, that is an entire page for each record it needs to read. As there are many records per page, that might boil down to reading a large percentage of the pages in the clustered index.
SQL server guessed, based on statistics, that simply scanning the pages in the clustered index would require less page reads in total, because it then avoids reading the pages in the non-clustered index.
What matters here is the number of rows in the temp table compared to the number of pages in the large table. Assuming a random distribution of ChildID in the large table, as soon as the number of rows in the temp table approaches or supersedes the number of pages in the large table, SQL server will have to read virtually every page in the large table anyway.
Because the column ChildType isn't covered in an index, it has to go back to the clustered index (with the mentioned Row IDentifier lookup) to get the values for ChildType.
When you INCLUDE this column in the nonclustered index it will be added to the leaf-level of the index where it is available for querying.
Colloquially is called 'the index tipping point'. Basically, at what point does the cost based optimizer consider that is more effective to do a scan rather than seek + lookup. Usually is around 20% of the size, which in your case will base on an estimate coming from the #temp table stats. YMMV.
You already have your answer: include the required column, make the index covering.

sql server multi column index queries

If I have created a single index on two columns [lastName] and [firstName] in that order. If I then do a query to find the number of the people with first name daniel:
SELECT count(*)
FROM people
WHERE firstName = N'daniel'
will this search in each section of the first index (lastname) and use the secondary index (firstName) to quickly search through each of the blocks of LastName entries?
This seems like an obvious thing to do and I assume that it is what happens but you know what they say about assumptions.
Yes, this query may - and probably do - use this index (and do an Index Scan) if the query optimizer thinks that it's better to "quickly search through each of the blocks of LastName entries" as you say than (do an Full Scan) of the table.
An index on (firstName) would be more efficient though for this particular query so if there is such an index, SQL-Server will use that one (and do an Index Seek).
Tested in SQL-Server 2008 R2, Express edition:
CREATE TABLE Test.dbo.people
( lastName NVARCHAR(30) NOT NULL
, firstName NVARCHAR(30) NOT NULL
) ;
INSERT INTO people
VALUES
('Johnes', 'Alex'),
... --- about 300 rows
('Johnes', 'Bill'),
('Brown', 'Bill') ;
Query without any index, Table Scan:
SELECT count(*)
FROM people
WHERE firstName = N'Bill' ;
Query with index on (lastName, firstName), Index Scan:
CREATE INDEX last_first_idx
ON people (lastName, firstName) ;
SELECT ...
Query with index on (firstName), Index Seek:
CREATE INDEX first_idx
ON people (firstName) ;
SELECT ...
If you have an index on (lastname, firstname), in this order, then a query like
WHERE firstname = 'daniel'
won't use the index, as long as you don't include the first column of the composite index (i.e. lastname) in the WHERE clause. To efficiently search for firstname only, you will need a separate index on that column.
If you frequently search on both columns, do 2 separate single column indexes. But keep in mind that each index will be updated on insert/update, so affecting performance.
Also, avoid composite indexes if they aren't covering indexes at the same time. For tips regarding composite indexes see the following article at sql-server-performance.com:
Tips on Optimizing SQL Server Composite Indexes
Update (to address downvoters):
In this specific case of SELECT Count(*) the index is a covering index (as pointed out by #ypercube in the comment), so the optimizer may choose it for execution. Using the index in this case means an Index Scan and not an Index Seek.
Doing an Index Scan means scanning every single row in the index. This will be faster, if the index contains less rows than the whole table. So, if you got a highly selective index (with many unique values) you'll get an index with roughly as many rows as the table itself. In such a case usually there won't be a big difference in doing a Clustered Index Scan (implies a PK on the table, iterates over the PK) or a Non-Clustered Index Scan (iterates over the index). A Table Scan (as seen in the screenshot of #ypercube's answer) means that there is no PK on the table, which results in an even slower execution than a Clustered Index Scan, as it doesn't have the advantage of sequential data alignment on disk given by a PK.

Resources