SQL Server: Indexing date column in a log table - sql-server

Example table:
CREATE TABLE Log (
logID int identity
logDate datetime
logText varchar(42)
)
logID is already indexed because it is a primary key, but if you were to query this table you would likely want to use logDate as a constraint. However, both logID and logDate are going to be in the same order because logDate would always be set to GETDATE().
Does it make sense to put an extra non-clustered index on logDate, taking into account that for the Log table it is important to have fast writes.

Make a clustered index logDate, logID (in that order).
As datetime is "growing" this should not cost anything extra. logID saves you from inserting two log entries at the same time (could happen)

If you'll have lots of query with a
WHILE LogDate > ......
(or something similar) clause - yes, by all means!
The clustered index on the LogID will not help if you select by date - so if that's a common operation, a non-clustered index on the date will definitely help a lot.
Marc

Arthur is spot on with as considering the clustered index on logdate, logid since you will potentialy run into duplicates within the window of the datetime accuracy (3.33ms on SQL 2005).
I would also consider in advance whether the table is going to be large and should use table partitioning to allow the log to have a sliding window and old data removed without the need for long running delete statements. That all depends on the quantity of log records and whether you have enterprise edition.

Related

SQL Server non-clustered index is not being used

An application (the underlying code of which can't be modified) is running a select statement against a relatively large table (~1.8M rows, 2GB) and creating a huge performance bottleneck on the DB server.
The table itself has approx. 120 columns of varying datatype. The select statement is selecting about 100 of those columns based on values of 2 columns which have both been indexed individually and together.
e.g.
SELECT
column1,
column2,
column3,
column4,
column5,
and so on.....
FROM
ITINDETAIL
WHERE
(column23 = 1 AND column96 = 1463522)
However, SQL Server chooses to ignore the index and instead scans the clustered (PK) index which takes forever (this was cancelled after 42 seconds, it's been known to take over 8 minutes on a busy production DB).
If I simply change the select statement, replacing all columns with just a select count(*) , the index is used and results return in milliseconds.
EDIT: I believe ITINDETAIL_004 (where column23 and column96 have been indexed together) is the index that should be getting used for the original statement.
Confusingly, if I create a non-clustered index on the table for the two columns in the where clause like this:
CREATE NONCLUSTERED INDEX [ITINDETAIL20220504]
ON [ITINDETAIL] ([column23] ASC, [column96] ASC)
INCLUDE (column1, column2, column3, column4, column5,
and so on..... )
and include ALL other columns from the select statement in the index (in my mind, this is essentially just creating a TABLE!), then run the original select statement, the new HUGE index is used and results are returned very quickly:
However, I'm not sure this is the right way to address the problem.
How can I force SQL Server to use what I think is the correct index?
Version is: SQL Server 2017
Based on additional details from the comments, it appears that the index you want SQL Server to use isn't a covering index; this means that the index doesn't contain all the columns that are referenced in the query. As such, if SQL Server were to use said index, then it would need to first do a seek on the index, and then perform a key lookup on the clustered index to get the full details of the row. Such lookups can be expensive.
As a result of the index you want not being covering, SQL Server has determined that the index you want it to to use would produce an inferior query plan to simply scanning the entire clustered index; which is by definition covering as it INCLUDEs all other columns not in the CLUSTERED INDEX.
For your index ITINDETAIL20220504 you have INCLUDEd all the columns that are in your SELECT, which means that it is covering. This means that SQL Server can perform a seek on the index, and get all the information it needs from that seek; which is far less costly that a seek followed by a key lookup and quicker than a scan of the entire clustered index. This is why this information works.
We coould put this into some kind of analogy using a Library type scenario, which is full of Books, to help explain this idea more:
Let's say that the Clustered Index is a list of every book in the library sorted by it's ISBN number (The Primary Key). Along side that ISBN number you have the details of the Author, Title, Publication Date, Publisher, If it's hardcover or softcover, the colour of the spine, the section of the Library the Book is located in, the book case, and the shelf.
Now let's say you want to obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. If you then wanted to you could go through the entire list, one by one, finding the books by that author, checking the publication date, and then writing down it's location so you can go and visit each of those locations and collect the book. This is effectively a Clustered Index Scan.
Now let's say you have a list of all the books in the Library again. The list contains the Author, Publication Date, and the ISBN (The Primary Key), and is ordered by the Author and the Publication Date. You want to fulfil the same task; obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. Now you can easily go through that list and find all those books, but you don't know where they are. As a result even after you have gone straight to the Brandon Sanderson "section" of the list, you'll still need to write all the ISBNs down, and then find each of those ISBN in the original list, get their location and title. This is your index ITINDETAIL_004; you can easily find the rows you want to filter to, but you don't have all the information so you have to go somewhere else afterwards.
Lastly we have a 3rd list, this list is ordered by the author and then publication date (like the 2nd list), but also includes the Title, the section of the Library the Book is located in, the book case, and the shelf, as well as the ISBN (Primary key). This list is ideal for your task; it's in the right order, as you can easily go to Brandon Sanderson and then the first book published on or after 2015-01-01, and it has the title and location of the book. This is your INDEX ITINDETAIL20220504 would be; it has the information in the order you want, and contains all the information you asked for.
Saying all this, you can force SQL Server to choose the index, but as I said in my comment:
In truth, rarely is the index you think is the correct is the correct index if SQL Server thinks otherwise. Despite what some believe, SQL Server is very good at making informed choices about what index(es) it should be using, provided your statistics are up to date, and the cached plan isn't based on old and out of date information.
Let's, however, demonstrate what happens if you do with a simple set up:
CREATE TABLE dbo.SomeTable (ID int IDENTITY(1,1) PRIMARY KEY,
SomeGuid uniqueidentifier DEFAULT NEWID(),
SomeDate date)
GO
INSERT INTO dbo.SomeTable (SomeDate)
SELECT DATEADD(DAY,T.I,'19000101')
FROM dbo.Tally(100000,0) T;
Now we have a table with 100001 rows, with a Primary Key on ID. Now let's do a query which is an overly simplified version of yours:
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';
No surprise, this results in an Clsutered Index Scan:
Ok, let's add an index on SomeDate and run the query again:
CREATE INDEX IX_NonCoveringIndex ON dbo.SomeTable (SomeDate);
GO
--Index Scan of Clustered index
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';
Same result, and SSMS has even suggested an index:
Now, as I mentioned, you can force SQL Server to use a specific index. Let's do that and see what happens:
SELECT ID,
SomeGuid
FROM dbo.SomeTable WITH (INDEX(IX_NonCoveringIndex))
WHERE SomeDate > '20220504';
And this gives exactly the plan I suggested; A key lookup:
This is expensive. In fact, if we turn on the statistics for IO and Time, the query without the index hint took 40ms, the one with the hint took 107ms in the first run. Subsequent runs all had the second query taking around double the time of the first. IO wise the first query has a simple scan and 398 logical reads; the latter had 5 scans and 114403 logical reads!
Now, finally, let's add that covering Index and run:
CREATE INDEX IX_CoveringIndex ON dbo.SomeTable (SomeDate) INCLUDE (SomeGuid);
GO
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';
Here we can see that seek we wanted:
If we look at the IO and times again compared to the prior 2, we get 1 scan, 202 logical reads, and it was running in about 25ms.

SQL Server doesn't use a suggested index

First, I am using: Microsoft SQL Server 2012 (SP1) - 11.0.3000.0 (X64)
I have created a table that looks like this:
create table dbo.pos_key
( keyid int identity(1,1) not null
, systemid int not null
, partyid int not null
, portfolioid int null
, instrumentid int not null
, security_no decimal(10,0) null
, entry_date datetime not null
)
keyid is a clustered primary key. My table has about 144,000 rows. Currently systemId doesn't have much fluctuation, it is the same in every row except 1.
Now I perform the following query:
select *
from pos_key
where systemid = 33000
and portfolioid = 150444
and instrumentid = 639
Which returns 1 row after a clustered index scan. [pos_key].[PK_pos_key]
Execution plan said that expected row count was 1.082
SQL Server quickly suggests that I add an index.
CREATE NONCLUSTERED INDEX IDX_SYS_PORT_INST
ON [dbo].[pos_key] ([systemid],[portfolioid],[instrumentid])
So I do and run the query again.
Surprisingly SQL-server doesn't use the new index, instead it again goes for the same clustered index scan and but now it claims to expect 4087 rows! It however doesn't suggest any new index this time.
To get it to use the new index I have done the following:
Updated table statistics (update statistics)
Updated index statistics (update statistics)
Dropped cached execution plans related to this queries (DBCC FREEPROCCACHE)
No luck, SQL server always goes for the clustered scan and expects 4087 rows.
Index statistics look like this:
All Density Average Length Columns
----------------------------------------------------------------------------
0.5 4 systemid
6.095331E-05 7.446431 systemid, portfolioid
1.862301E-05 11.44643 systemid, portfolioid, instrumentid
6.9314E-06 15.44643 systemid, portfolioid, instrumentid, keyid
Curiously I left this overnight and in the morning ran the query again and BAMM now it hits the index. I dropped the index, ran the select and then created the index again. Now SQL server is back to 4087 expected rows and clustered index scan.
So what am I missing. The index obviously works but SQL server doesn't want to use it right away.
Is the of fluctuation in systemId somehow causing trouble?
Is DBCC FREEPROCCACHE not enough to get rid of cached execution plans?
Are the ways of SQL-Server just mysterious?
With a composite index and all columns used in equality predicates, specify the most selective column first (portfolieid here). SQL Server maintains a histogram only for the first column.
With the less selective column first, SQL Server probably overestimated the row count and chose to the clustered index scan instead thinking it was more efficient since you are selecting all columns.

SQL Server 2014 Index Optimization: Any benefit with including primary key in indexes?

After running a query, the SQL Server 2014 Actual Query Plan shows a missing index like below:
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1) INCLUDE
(PK_Column,SomeOtherColumn)
The missing index suggests to include the Primary Key column in the index. The table is clustered index with the PK_Column.
I am confused and it seems that I don’t get the concept of Clustered Index Primary Key right.
My assumption was: when a table has a clustered PK, all of the non-clustered indexes point to the PK value. Am I correct? If I am, why the query plan missing index asks me to include the PK column in the index?
Summary:
Index advised is not valid,but it doesn't make any difference.See below tests section for details..
After researching for some time,found an answer here and below statement explains convincingly about missing index feature..
they only look at a single query, or a single operation within a single query. They don't take into account what already exists or your other query patterns.
You still need a thinking human being to analyze the overall indexing strategy and make sure that you index structure is efficient and cohesive.
So coming to your question,this index advised may be valid ,but should not to be taken for granted. The index advised is useful for SQL Server for the particular query executed, to reduce cost.
This is the index that was advised..
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1)
INCLUDE (PK_Column, SomeOtherColumn)
Assume you have a query like below..
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
SQL Server tries to scan a narrow index as well if available, so in this case an index as advised will be helpful..
Further you didn't share the schema of table, if you have an index like below
create index nci_test on table(column1)
and a query of below form will advise again same index as stated in question
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
Update :
i have orders table with below schema..
[orderid] [int] NOT NULL Primary key,
[custid] [char](11) NOT NULL,
[empid] [int] NOT NULL,
[shipperid] [varchar](5) NOT NULL,
[orderdate] [date] NOT NULL,
[filler] [char](160) NOT NULL
Now i created one more index of below structure..
create index onlyempid on orderstest(empid)
Now when i have a query of below form
select empid,orderid,orderdate --6.3 units
from orderstest
where empid=5
index advisor will advise below missing index .
CREATE NONCLUSTERED INDEX empidalongwithorderiddate
ON [dbo].[orderstest] ([empid])
INCLUDE ([orderid],[orderdate])--you can drop orderid too ,it doesnt make any difference
If you can see orderid is also included in above suggestion
now lets create it and observe both structures..
---Root level-------
For index onlyempid..
for index empidalongwithorderiddate
----leaf level-------
For index onlyempid..
for index empidalongwithorderiddate
As you can see , creating as per suggestion makes no difference,Even though it is invalid.
I Assume suggestion was made by Index advisor based on query ran and is specifically for the query and it has no idea of other indexes involved
I don't know your schema, nor your queries. Just guessing.
Please correct me if this theory is incorrect.
You are right that non-clustered indexes point to the PK value. Imagine you have large database (for example gigabytes of files) stored on ordinary platter hard-drive. Lets suppose that the disk is fragmented and the PK_index is saved physical far from your Table1 Index.
Imagine that your query need to evaluate Column1 and PK_column as well. The query execution read Column1 value, then PK_value, then Column1 value, then PK_value...
The hard-drive platter is spinning from one physical place to another, this can take time.
Having all you need in one index is more effective, because it means reading one file sequentially.

SQL Server: Clustering by timestamp; pros/cons

I have a table in SQL Server, where i want inserts to be added to the end of the table (as opposed to a clustering key that would cause them to be inserted in the middle). This means I want the table clustered by some column that will constantly increase.
This could be achieved by clustering on a datetime column:
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (CreatedDate)
)
But I can't guaranteed that two Things won't have the same time. So my requirements can't really be achieved by a datetime column.
I could add a dummy identity int column, and cluster on that:
CREATE TABLE Things (
...
RowID int IDENTITY(1,1),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (RowID)
)
But you'll notice that my table already constains a timestamp column; a column which is guaranteed to be a monotonically increasing. This is exactly the characteristic I want for a candidate cluster key.
So I cluster the table on the rowversion (aka timestamp) column:
CREATE TABLE Things (
...
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (timestamp)
)
Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.
What I'm looking for are thoughts of why this is a bad idea; and what other ideas are better.
Note: Community wiki, since the answers are subjective.
So I cluster the table on the
rowversion (aka timestamp) column:
Rather than adding a dummy identity
int column (RowID) to ensure an order,
I use what I already have.
That might sound like a good idea at first - but it's really almost the worst option you have. Why?
The main requirements for a clustered key are (see Kim Tripp's blog post for more excellent details):
stable
narrow
unique
ever-increasing if possible
Your rowversion violates the stable requirement, and that's probably the most important one. The rowversion of a row changes with each modification to the row - and since your clustering key is being added to each and every non-clustered index in the table, your server will be constantly updating loads of non-clustered indices and wasting a lot of time doing so.
In the end, adding a dummy identity column probably is a much better alternative for your case. The second best choice would be the datetime column - but here, you do run the risk of SQL Server having to add "uniqueifiers" to your entries when duplicates occur - and with a 3.33ms accuracy, this could definitely be happening - not optimal, but definitely much better than the rowversion idea...
from the link: timestamp in the question:
The timestamp syntax is deprecated.
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature.
and
Duplicate rowversion values can be
generated by using the SELECT INTO
statement in which a rowversion column
is in the SELECT list. We do not
recommend using rowversion in this
manner.
so why on earth would you want to cluster by either, especially since their values alwsys change when the row is updated? just use an identity as the PK and cluster on it.
You were on the right track already. You can use a DateTime column that holds the created date and create a CLUSTERED but non unique constraint.
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
)
CREATE CLUSTERED INDEX [IX_CreatedDate] ON .[Things]
(
[CreatedDate] ASC
)
If this table gets a lot of inserts, you might be creating a hot spot that interferes with updates, because all of the inserts will be happening on the same physical/index pages. Check your locking setup.

Big Table Advice (SQL Server)

I'm experiencing massive slowness when accessing one of my tables and I need some re-factoring advice. Sorry if this is not the correct area for this sort of thing.
I'm working on a project that aims to report on server performance statistics for our internal servers. I'm processing windows performance logs every night (12 servers, 10 performance counters and logging every 15 seconds). I'm storing the data in a table as follows:
CREATE TABLE [dbo].[log](
[id] [int] IDENTITY(1,1) NOT NULL,
[logfile_id] [int] NOT NULL,
[test_id] [int] NOT NULL,
[timestamp] [datetime] NOT NULL,
[value] [float] NOT NULL,
CONSTRAINT [PK_log] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH FILLFACTOR = 90 ON [PRIMARY]
) ON [PRIMARY]
There's currently 16,529,131 rows and it will keep on growing.
I access the data to produce reports and create graphs from coldfusion like so:
SET NOCOUNT ON
CREATE TABLE ##RowNumber ( RowNumber int IDENTITY (1, 1), log_id char(9) )
INSERT ##RowNumber (log_id)
SELECT l.id
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
select rn.RowNumber, l.value, l.timestamp
from log l, logfile lf, ##RowNumber rn
where lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#
and l.logfile_id = lf.id
and rn.log_id = l.id
and ((rn.rownumber % #modu# = 0) or (rn.rownumber = 1))
order by l.timestamp asc
DROP TABLE ##RowNumber
SET NOCOUNT OFF
(for not CF devs #value# inserts value and ## maps to #)
I basically create a temporary table so that I can use the rownumber to select every x rows. In this way I'm only selecting the amount of rows I can display. this helps but it's still very slow.
SQL Server Management Studio tells me my index's are as follows (I have pretty much no knowledge about using index's properly):
IX_logfile_id (Non-Unique, Non-Clustered)
IX_test_id (Non-Unique, Non-Clustered)
IX_timestamp (Non-Unique, Non-Clustered)
PK_log (Clustered)
I would be very grateful to anyone who could give some advice that could help me speed things up a bit. I don't mind re-organising things and I have complete control of the project (perhaps not over the server hardware though).
Cheers (sorry for the long post)
Your problem is that you chose a bad clustered key. Nobody is ever interested in retrieving one particular log value by ID. I your system is like anything else I've seen, then all queries are going to ask for:
all counters for all servers over a range of dates
specific counter values over all servers for a range of dates
all counters for one server over a range of dates
specific counter for specific server over a range of dates
Given the size of the table, all your non-clustered indexes are useless. They are all going to hit the index tipping point, guaranteed, so they might just as well not exists. I assume all your non-clustered index are defined as a simple index over the field in the name, with no include fields.
I'm going to pretend I actually know your requirements. You must forget common sense about storage and actually duplicate all your data in every non-clustered index. Here is my advice:
Drop the clustered index on [id], is a as useless as is it gets.
Organize the table with a clustered index (logfile_it, test_id, timestamp).
Non-clusterd index on (test_id, logfile_id, timestamp) include (value)
NC index on (logfile_id, timestamp) include (value)
NC index on (test_id, timestamp) include (value)
NC index on (timestamp) include (value)
Add maintenance tasks to reorganize all indexes periodically as they are prone to fragmentation
The clustered index covers the query 'history of specific counter value at a specific machine'. The non clustered indexes cover various other possible queries (all counters at a machine over time, specific counter across all machines over time etc).
You notice I did not comment anything about your query script. That is because there isn't anything in the world you can do to make the queries run faster over the table structure you have.
Now one thing you shouldn't do is actually implement my advice. I said I'm going to pretend I know your requirements. But I actually don't. I just gave an example of a possible structure. What you really should do is study the topic and figure out the correct index structure for your requirements:
General Index Design Guidelines.
Index Design Basics
Index with Included Columns
Query Types and Indexes
Also a google on 'covering index' will bring up a lot of good articles.
And of course, at the end of the day storage is not free so you'll have to balance the requirement to have a non-clustered index on every possible combination with the need to keep the size of the database in check. Luckly you have a very small and narrow table, so duplicating it over many non-clustered index is no big deal. Also I wouldn't be concerned about insert performance, 120 counters at 15 seconds each means 8-9 inserts per second, which is nothing.
A couple things come to mind.
Do you need to keep that much data? If not, consider either creating an archive table if you want to keep it (but don't create it just to join it with the primary table every time you run a query).
I would avoid using a temp table with so much data. See this article on temp table performance and how to avoid using them.
http://www.sql-server-performance.com/articles/per/derived_temp_tables_p1.aspx
It looks like you are missing an index on the server_id field. I would consider creating a covered index using this field and others. Here is an article on that as well.
http://www.sql-server-performance.com/tips/covering_indexes_p1.aspx
Edit
With that many rows in the table over such a short time frame, I would also check the indexes for fragmentation which may be a cause for slowness. In SQL Server 2000 you can use the DBCC SHOWCONTIG command.
See this link for info http://technet.microsoft.com/en-us/library/cc966523.aspx
Also, please note that I have numbered these items as 1,2,3,4 however the editor is automatically resetting them
Once when still working with sql server 2000, i needed to do some paging, and i came accross a method of paging that realy blew my mind. Have a look at this method.
DECLARE #Table TABLE(
TimeVal DATETIME
)
DECLARE #StartVal INT
DECLARE #EndVal INT
SELECT #StartVal = 51, #EndVal = 100
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
--select up to end number
SELECT TOP (#EndVal)
*
FROM #Table
ORDER BY TimeVal ASC
) PageReversed
ORDER BY TimeVal DESC
) PageVals
ORDER BY TimeVal ASC
As an example
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
SELECT TOP (#EndVal)
l.id,
l.timestamp
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
) PageReversed ORDER BY timestamp DESC
) PageVals
ORDER BY timestamp ASC

Resources