Big Table Advice (SQL Server) - sql-server

I'm experiencing massive slowness when accessing one of my tables and I need some re-factoring advice. Sorry if this is not the correct area for this sort of thing.
I'm working on a project that aims to report on server performance statistics for our internal servers. I'm processing windows performance logs every night (12 servers, 10 performance counters and logging every 15 seconds). I'm storing the data in a table as follows:
CREATE TABLE [dbo].[log](
[id] [int] IDENTITY(1,1) NOT NULL,
[logfile_id] [int] NOT NULL,
[test_id] [int] NOT NULL,
[timestamp] [datetime] NOT NULL,
[value] [float] NOT NULL,
CONSTRAINT [PK_log] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH FILLFACTOR = 90 ON [PRIMARY]
) ON [PRIMARY]
There's currently 16,529,131 rows and it will keep on growing.
I access the data to produce reports and create graphs from coldfusion like so:
SET NOCOUNT ON
CREATE TABLE ##RowNumber ( RowNumber int IDENTITY (1, 1), log_id char(9) )
INSERT ##RowNumber (log_id)
SELECT l.id
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
select rn.RowNumber, l.value, l.timestamp
from log l, logfile lf, ##RowNumber rn
where lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#
and l.logfile_id = lf.id
and rn.log_id = l.id
and ((rn.rownumber % #modu# = 0) or (rn.rownumber = 1))
order by l.timestamp asc
DROP TABLE ##RowNumber
SET NOCOUNT OFF
(for not CF devs #value# inserts value and ## maps to #)
I basically create a temporary table so that I can use the rownumber to select every x rows. In this way I'm only selecting the amount of rows I can display. this helps but it's still very slow.
SQL Server Management Studio tells me my index's are as follows (I have pretty much no knowledge about using index's properly):
IX_logfile_id (Non-Unique, Non-Clustered)
IX_test_id (Non-Unique, Non-Clustered)
IX_timestamp (Non-Unique, Non-Clustered)
PK_log (Clustered)
I would be very grateful to anyone who could give some advice that could help me speed things up a bit. I don't mind re-organising things and I have complete control of the project (perhaps not over the server hardware though).
Cheers (sorry for the long post)

Your problem is that you chose a bad clustered key. Nobody is ever interested in retrieving one particular log value by ID. I your system is like anything else I've seen, then all queries are going to ask for:
all counters for all servers over a range of dates
specific counter values over all servers for a range of dates
all counters for one server over a range of dates
specific counter for specific server over a range of dates
Given the size of the table, all your non-clustered indexes are useless. They are all going to hit the index tipping point, guaranteed, so they might just as well not exists. I assume all your non-clustered index are defined as a simple index over the field in the name, with no include fields.
I'm going to pretend I actually know your requirements. You must forget common sense about storage and actually duplicate all your data in every non-clustered index. Here is my advice:
Drop the clustered index on [id], is a as useless as is it gets.
Organize the table with a clustered index (logfile_it, test_id, timestamp).
Non-clusterd index on (test_id, logfile_id, timestamp) include (value)
NC index on (logfile_id, timestamp) include (value)
NC index on (test_id, timestamp) include (value)
NC index on (timestamp) include (value)
Add maintenance tasks to reorganize all indexes periodically as they are prone to fragmentation
The clustered index covers the query 'history of specific counter value at a specific machine'. The non clustered indexes cover various other possible queries (all counters at a machine over time, specific counter across all machines over time etc).
You notice I did not comment anything about your query script. That is because there isn't anything in the world you can do to make the queries run faster over the table structure you have.
Now one thing you shouldn't do is actually implement my advice. I said I'm going to pretend I know your requirements. But I actually don't. I just gave an example of a possible structure. What you really should do is study the topic and figure out the correct index structure for your requirements:
General Index Design Guidelines.
Index Design Basics
Index with Included Columns
Query Types and Indexes
Also a google on 'covering index' will bring up a lot of good articles.
And of course, at the end of the day storage is not free so you'll have to balance the requirement to have a non-clustered index on every possible combination with the need to keep the size of the database in check. Luckly you have a very small and narrow table, so duplicating it over many non-clustered index is no big deal. Also I wouldn't be concerned about insert performance, 120 counters at 15 seconds each means 8-9 inserts per second, which is nothing.

A couple things come to mind.
Do you need to keep that much data? If not, consider either creating an archive table if you want to keep it (but don't create it just to join it with the primary table every time you run a query).
I would avoid using a temp table with so much data. See this article on temp table performance and how to avoid using them.
http://www.sql-server-performance.com/articles/per/derived_temp_tables_p1.aspx
It looks like you are missing an index on the server_id field. I would consider creating a covered index using this field and others. Here is an article on that as well.
http://www.sql-server-performance.com/tips/covering_indexes_p1.aspx
Edit
With that many rows in the table over such a short time frame, I would also check the indexes for fragmentation which may be a cause for slowness. In SQL Server 2000 you can use the DBCC SHOWCONTIG command.
See this link for info http://technet.microsoft.com/en-us/library/cc966523.aspx
Also, please note that I have numbered these items as 1,2,3,4 however the editor is automatically resetting them

Once when still working with sql server 2000, i needed to do some paging, and i came accross a method of paging that realy blew my mind. Have a look at this method.
DECLARE #Table TABLE(
TimeVal DATETIME
)
DECLARE #StartVal INT
DECLARE #EndVal INT
SELECT #StartVal = 51, #EndVal = 100
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
--select up to end number
SELECT TOP (#EndVal)
*
FROM #Table
ORDER BY TimeVal ASC
) PageReversed
ORDER BY TimeVal DESC
) PageVals
ORDER BY TimeVal ASC
As an example
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
SELECT TOP (#EndVal)
l.id,
l.timestamp
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
) PageReversed ORDER BY timestamp DESC
) PageVals
ORDER BY timestamp ASC

Related

How to generate a single GUID for all rows in a batch insert within the query?

I am writing a quick-and-dirty application to load sales plan data into SQL Server (2008 FWIW, though I don't think the specific version matters).
The data set is the corporate sales plan: a few thousand rows of Units, Dollars and Price for each combination of customer, part number and month. This data is updated every few weeks, and it's important to track who changed it and what the changes were.
-- Metadata columns are suffixed with ' ##', to enable an automated
-- tool I wrote to handle repetitive tasks such as de-duplication of
-- records whose values didn't change in successive versions of the
-- forecast.
CREATE TABLE [SlsPlan].[PlanDetail]
(
[CustID] [char](15) NOT NULL,
[InvtID] [char](30) NOT NULL,
[FiscalYear] [int] NOT NULL,
[FiscalMonth] [int] NOT NULL,
[Version Number ##] [int] IDENTITY(1,1) NOT NULL,
[Units] [decimal](18, 6) NULL,
[Unit Price] [decimal](18, 6) NULL,
[Dollars] [decimal](18, 6) NULL,
[Batch GUID ##] [uniqueidentifier] NOT NULL,
[Record GUID ##] [uniqueidentifier] NOT NULL DEFAULT (NEWSEQUENTIALID()),
[Time Created ##] [datetime] NOT NULL,
[User ID ##] [varchar](64) NULL DEFAULT (ORIGINAL_LOGIN()),
CONSTRAINT [PlanByProduct_PK] PRIMARY KEY CLUSTERED
([CustID], [InvtID], [FiscalYear], [FiscalMonth], [Version Number ##])
)
To track changes, I'm using an IDENTITY column as part of the primary key to enable multiple version with the same primary key. To track who did the change, and also to enable backing out an entire bad update if someone does something completely stupid, I am inserting the Active Directory logon of the creator of that version of the record, a time stamp, and two GUIDs.
The "Batch GUID" column should be the same for all records in a batch; the "Record GUID" column is obviously unique to that particular record and is used for de-duplication only, not for any sort of query.
I would strongly prefer to generate the batch GUID inside a query rather than by writing a stored procedure that does the obvious:
DECLARE #BatchGUID UNIQUEIDENTIFIER = NEWID()
INSERT INTO MyTable
SELECT I.*, #BatchGUID
FROM InputTable I
I figured the easy way to do this is to construct a single-row result with the timestamp, the user ID and a call to NEWID() to create the batch GUID. Then, do a CROSS JOIN to append that single row to each of the rows being inserted. I tried doing this a couple different ways, and it appears that the query execution engine is essentially executing the GETDATE() once, because a single time stamp appears in all rows (even for a 5-million row test case). However, I get a different GUID for each row in the result set.
The below examples just focus on the query, and omit the insert logic around them.
WITH MySingleRow AS
(
Select NewID() as [Batch GUID ##],
ORIGINAL_LOGIN() as [User ID ##],
getdate() as [Time Created ##]
)
SELECT N.*, R1.*
FROM util.zzIntegers N
CROSS JOIN MySingleRow R1
WHERE N.Sequence < 10000000
In the above query, "util.zzIntegers" is just a table of integers from 0 to 10 million. The query takes about 10 seconds to run on my server with a cold cache, so if SQL Server were executing the GETDATE() function with each row of the main table, it would certainly have a different value at least in the milliseconds column, but all 10 million rows have the same timestamp. But I get a different GUID for each row. As I said before, the goal is to have the same GUID in each row.
I also decided to try a version with an explicit table value constructor in hopes that I would be able to fool the optimizer into doing the right thing. I also ran it against a real table rather than a relatively "synthetic" test like a single-column list of integers. The following produced the same result.
WITH AnotherSingleRow AS
(
SELECT SingleRow.*
FROM (
VALUES (NewID(), Original_Login(), getdate())
)
AS SingleRow(GUID, UserID, TimeStamp)
)
SELECT R1.*, S.*
FROM SalesOrderLineItems S
CROSS JOIN AnotherSingleRow R1
The SalesOrderLineItems is a table with 6 million rows and 135 columns, to make doubly sure that runtime was sufficiently long that the GETDATE() would increment if SQL Server were completely optimizing away the table value constructor and just calling the function each time the query runs.
I've been lurking here for a while, and this is my first question, so I definitely wanted to do good research and avoid criticism for just throwing a question out there. The following questions on this site deal with GUIDs but aren't directly relevant. I also spent a half hour searching Google with various combinations of phrases didn't seem to turn up anything.
Azure actually does what I want, as evidenced in the following question I
turned up in my research:
Guid.NewGuid() always return same Guid for all rows.
However, I'm not on Azure and not going to go there anytime soon.
Someone tried to do the same thing in SSIS
(How to insert the same guid in SSIS import)
but the answer to that query came back that you generate the GUID in
SSIS as a variable and insert it into each row. I could certainly do
the equivalent in a stored procedure but for the sake of elegance and
maintainability (my colleagues have less experience with SQL Server queries
than I do), I would prefer to keep the creation of the batch GUID in
a query, and to simplify any stored procedures as much as possible.
BTW, my experience level is 1-2 years with SQL Server as a data analyst/SQL developer as part of 10+ years spent writing code, but for the last 20 years I've been mostly a numbers guy rather than an IT guy. Early in my career, I worked for a pioneering database vendor as one of the developers of the query optimizer, so I have a pretty good idea what a query optimizer does, but haven't had time to really dig into how SQL Server does it. So I could be completely missing something that's obvious to others.
Thank you in advance for your help.

SQL Server Indexing Strategy

I'm building a web application using SQL Server 2008 and am having difficulty coming up with the best indexing strategy given our use-case. For example, most of the tables are structured similar to the following:
CREATE TABLE Jobs
(
Id int identity(0, 1) not null,
CmpyId int not null default (0),
StatusId int not null default (0),
Name nvarchar(100) null,
IsDeleted bit not null default (0),
CONSTRAINT [PK_dbo.Jobs]
PRIMARY KEY NONCLUSTERED (Id ASC))
CREATE CLUSTERED INDEX IX_Jobs_CmpyIdAndId
ON Jobs (CmpyId, Id)
CREATE INDEX IX_Jobs_CmpyIdAndStatusId
ON Jobs (CmpyId, StatusId)
In our application, users are separated into different companies which results in nearly all queries looking similar to the following:
SELECT *
FROM Jobs
WHERE CmpyId = #cmpyId AND ...
Additionally, jobs are frequently accessed by StatusId (canceled = -1, pending = 0, open = 1, assigned = 2, closed = 3), similar to the following:
SELECT *
FROM Jobs
WHERE CmpyId = #cmpyId
AND StatusId >= 0
AND StatusId < 3
Would I be better off using the composite clustered index as shown above, or should I use the default clustered index on the Id field only and create a separate index for CmpyId?
For the StatusId column, would I be correct in assuming a filtered index would be the way to go?
I'm also considering partitioning the table by CmpyId or StatusId, but not sure which would be best (or if no partition is best).
This is kinda premature optimization. You can spend a lot of time worrying about which one will net you a slightly faster database, but when you are live in production, is when you will have best chance of optimizing your indexes.
SQL Server has traces to see which queries are being ran the most and taking the longest. You can test out different indexing strategies when its live in production with almost no risk. At worst you can slow your application down.
I typically setup clustered indexes on the primary key. And non clustered on all important columns. This works good for the JVM stack that is used with the SQL Server. You don't know where the bottle necks are going to be without having data to see it.

DB Engine Tuning Advisor suggestion improvement

We have a table which holds all email messages ready to send and which have already been sent. The table contains over 1 million rows.
Below is the query to find the messages which still need to be sent. After 5 errors the message is not attempted anymore and needs to be fixed manually. SentDate remains null until the message is sent.
SELECT TOP (15)
ID,
FromEmailAddress,
FromEmailDisplayName,
ReplyToEmailAddress,
ToEmailAddresses,
CCEmailAddresses,
BCCEmailAddresses,
[Subject],
Body,
AttachmentUrl
FROM sysEmailMessage
WHERE ErrorCount < 5
AND SentDate IS NULL
ORDER BY CreatedDate
The query is slow, I assumed due to lacking indexes. I've offered the query to the Database Engine Tuning Advisor. It suggests the below index (and some statistics, which I generally ignore):
SET ANSI_PADDING ON
CREATE NONCLUSTERED INDEX [_dta_index_sysEmailMessage_7_1703677117__K14_K1_K12_5_6_7_8_9_10_11_15_17_18] ON [dbo].[sysEmailMessage]
(
[SentDate] ASC,
[ID] ASC,
[ErrorCount] ASC
)
INCLUDE ( [FromEmailAddress],
[ToEmailAddresses],
[CCEmailAddresses],
[BCCEmailAddresses],
[Subject],
[Body],
[AttachmentUrl],
[CreatedDate],
[FromEmailDisplayName],
[ReplyToEmailAddress]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
(On a sidenote: this index has a suggested size of 5,850,573 KB (?) which is neary 6 GB and doesn't make any sense to me at all.)
My question is does this suggested index make any sense? Why for example is the ID column included, while it's not needed in the query (as far as I can tell)?
As far as my knowledge of indexes goes they are meant to be a fast lookup to find the relevant row. If I had to design the index myself I would come up with something like:
SET ANSI_PADDING ON
CREATE NONCLUSTERED INDEX [index_alternative_a] ON [dbo].[sysEmailMessage]
(
[SentDate] ASC,
[ErrorCount] ASC
)
WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
Is the optimizer really clever or is my index more efficient and probably better?
There's 2 different aspects for selecting an index, the fields you need for finding the rows (=actual indexed fields), and the fields that are needed after that (=included fields). If you're always doing top 15 rows, you can totally ignore included fields because 15 keylookups will be fast -- and adding the whole email to the index would make it huge.
For the indexed fields, it's quite important to know how big percentage of the data matches your criteria.
Assuming almost all of your rows have ErrorCount < 5, you should not have it in the index -- but if it's a rare case, then it's good to have.
Assuming SentDate is really rarely NULL, then you should have that as the first column of the index.
Having CreatedDate in the index depends on how many rows on average the are found from the table with the ErrorCount and SentDate criteria. If it is a lot (thousands) then it might help to have it there so the newest can be found fast.
But like always, several things affect the performance so you should test how different options affect your environment.

Can I have a primary key and a separate clustered index together?

Let's assume I already have a primary key, which makes sure uniqueness. My primary key is also ordering index for the records. However, I am curious about the primary key's task in physical order of records in the disk (if there is). And the actual question is can I have a separate clustered index for these records?
This is an attempt at testing the size and performance characteristics of a covering secondary index on a clustered table, as per discussion with #Catcall.
All tests were done on MS SQL Server 2008 R2 Express (inside a fairly underpowered VM).
Size
First, I crated a clustered table with a secondary index and filled it with some test data:
CREATE TABLE THE_TABLE (
FIELD1 int,
FIELD2 int NOT NULL,
CONSTRAINT THE_TABLE_PK PRIMARY KEY (FIELD1)
);
CREATE INDEX THE_TABLE_IE1 ON THE_TABLE (FIELD2) INCLUDE (FIELD1);
DECLARE #COUNT int = 1;
WHILE #COUNT <= 1000000 BEGIN
INSERT INTO THE_TABLE (FIELD1, FIELD2) VALUES (#COUNT, #COUNT);
SET #COUNT = #COUNT + 1;
END;
EXEC sp_spaceused 'THE_TABLE';
The last line gave me the following result...
name rows reserved data index_size unused
THE_TABLE 1000000 27856 KB 16808 KB 11008 KB 40 KB
So, the index's B-Tree (11008 KB) is actually smaller than the table's B-Tree (16808 KB).
Speed
I generated a random number within the range of the data in the table, and then used it as criteria for selecting a whole row from the table. This was repeated 10000 times and the total time measured:
DECLARE #I int = 1;
DECLARE #F1 int;
DECLARE #F2 int;
DECLARE #END_TIME DATETIME2;
DECLARE #START_TIME DATETIME2 = SYSDATETIME();
WHILE #I <= 10000 BEGIN
SELECT #F1 = FIELD1, #F2 = FIELD2
FROM THE_TABLE
WHERE FIELD1 = (SELECT CEILING(RAND() * 1000000));
SET #I = #I + 1;
END;
SET #END_TIME = SYSDATETIME();
SELECT DATEDIFF(millisecond, #START_TIME, #END_TIME);
The last line produces an average time (of 10 measurements) of 181.3 ms.
When I change the query condition to: WHERE FIELD2 = ..., so the secondary index is used, the average time is 195.2 ms.
Execution plans:
So the performance (of selecting on the PK versus on the covering secondary index) seems to be similar. For much larger amounts of data, I suspect the secondary index could possibly be slightly faster (since it seems more compact and therefore cache-friendly), but I didn't hit that yet in my testing.
String Measurements
Using varchar(50) as type for FIELD1 and FIELD2 and inserting strings that vary in length between 22 and 28 characters gave similar results.
The sizes were:
name rows reserved data index_size unused
THE_TABLE 1000000 208144 KB 112424 KB 95632 KB 88 KB
And the average timings were: 254.7 ms for searching on FIELD1 and 296.9 ms fir FIELD2.
Conclusion
If a clustered table has a covering secondary index, that index will have space and time characteristics similar to the table itself (possibly slightly slower, but not by much). If effect, you'll have two B-Trees that sort their data differently, but are otherwise very similar, achieving your goal of having a "second cluster".
It depends on your dbms. Not all of them implement clustered indexes. Those that do are liable to implement them in different ways. As far as I know, every platform that implements clustered indexes also provides ways to choose which columns are in the clustered index, although often the primary key is the default.
In SQL Server, you can create a nonclustered primary key and a separate clustered index like this.
create table test (
test_id integer primary key nonclustered,
another_column char(5) not null unique clustered
);
I think that the closest thing to this in Oracle is an index organized table. I could be wrong. It's not quite the same as creating a table with a clustered index in SQL Server.
You can't have multiple clustered indexes on a single table in SQL Server. A table's rows can only be stored in one order at a time. Actually, I suppose you could store rows in multiple, distinct orders, but you'd have to essentially duplicate all or part of the table for each order. (Although I didn't know it at the time I wrote this answer, DB2 UDB supports multiple clustered indexes, and it's quite an old feature. Its design and implementation is quite different from SQL Server.)
A primary key's job is to guarantee uniqueness. Although that job is often done by creating a unique index on the primary key column(s), strictly speaking uniqueness and indexing are two different things with two different aims. Uniqueness aims for data integrity; indexing aims for speed.
A primary key declaration isn't intended to give you any information about the order of rows on disk. In practice, it usually gives you some information about the order of index entries on disk. (Because primary keys are usually implemented using a unique index.)
If you SELECT rows from a table that has a clustered index, you still can't be assured that the rows will be returned to the user in the same order that they're stored on disk. Loosely speaking, the clustered index helps the query optimizer find rows faster, but it doesn't control the order in which those rows are returned to the user. The only way to guarantee the order in which rows are returned to the user is with an explicit ORDER BY clause. (This seems to be a fairly frequent point of confusion. A lot of people seem surprised when a bare SELECT on a clustered index doesn't return rows in the order they expect.)

Querying minimum value in SQL Server is a lot longer than querying all the rows

I'm currently confronted with a strange behaviour in my database when I'm querying a minimum ID for a specific date in a table contains about a hundred million rows. The query is quite simple :
SELECT MIN(Id) FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
This query nevers end, at least I let it run for hours. The DateConnection column is not an index neither included in one. So I would understand that this query can last quite a bit. But I tried the following query which runs in few seconds :
SELECT Id FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
It returns 300k rows.
My table is defined as this :
CREATE TABLE [dbo].[Connection](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[DateConnection] [datetime] NOT NULL,
[TimeConnection] [time](7) NOT NULL,
[Hour] AS (datepart(hour,[TimeConnection])) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection] PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
)
And it has the following index :
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id] ON [dbo].[Connection]
(
[Id] ASC
)ON [PRIMARY]
One solutions I find using this strange behaviour is using the following code. But it seems to me quite a bit heavy for such a simple query.
create table #TempId
(
[Id] bigint
)
go
insert into #TempId
select id from partitionned_connection with(nolock) where dateconnection = '2012-06-26'
declare #displayId bigint
select #displayId = min(Id) from #CoIdTest
print #displayId
go
drop table #TempId
go
Has anybody been confronted to this behaviour and what is the cause of it ? Is the minimum aggregate scanning the entire table ? And if this is the case why the simple select does not ?
The root cause of the problem is the non-aligned nonclustered index, combined with the statistical limitation Martin Smith points out (see his answer to another question for details).
Your table is partitioned on [Hour] along these lines:
CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23);
CREATE PARTITION SCHEME PS
AS PARTITION PF ALL TO ([PRIMARY]);
-- Partitioned
CREATE TABLE dbo.Connection
(
Id bigint IDENTITY(1,1) NOT NULL,
DateConnection datetime NOT NULL,
TimeConnection time(7) NOT NULL,
[Hour] AS (DATEPART(HOUR, TimeConnection)) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection]
PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
ON PS ([Hour])
);
-- Not partitioned
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id]
ON dbo.Connection
(
Id ASC
)ON [PRIMARY];
-- Pretend there are lots of rows
UPDATE STATISTICS dbo.Connection WITH ROWCOUNT = 200000000, PAGECOUNT = 4000000;
The query and execution plan are:
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH (READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26';
The optimizer takes advantage of the index (ordered on Id) to transform the MIN aggregate to a TOP (1) - since the minimum value will by definition be the first value encountered in the ordered stream. (If the nonclustered index were also partitioned, the optimizer would not choose this strategy since the required ordering would be lost).
The slight complication is that we also need to apply the predicate in the WHERE clause, which requires a lookup to the base table to fetch the DateConnection value. The statistical limitation Martin mentions explains why the optimizer estimates it will only need to check 119 rows from the ordered index before finding one with a DateConnection value that will match the WHERE clause. The hidden correlation between DateConnection and Id values means this estimate is a very long way off.
In case you are interested, the Compute Scalar calculates which partition to perform the Key Lookup into. For each row from the nonclustered index, it computes an expression like [PtnId1000] = Scalar Operator(RangePartitionNew([dbo].[Connection].[Hour] as [c].[Hour],(1),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23))), and this is used as the leading key of the lookup seek. There is prefetching (read-ahead) on the nested loops join, but this needs to be an ordered prefetch to preserve the sorting required by the TOP (1) optimization.
Solution
We can avoid the statistical limitation (without using query hints) by finding the minimum Id for each Hour value, and then taking the minimum of the per-hour minimums:
-- Global minimum
SELECT
MinID = MIN(PerHour.MinId)
FROM
(
-- Local minimums (for each distinct hour value)
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH(READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26'
GROUP BY
c.[Hour]
) AS PerHour;
The execution plan is:
If parallelism is enabled, you will see a plan more like the following, which uses parallel index scan and multi-threaded stream aggregates to produce the result even faster:
Although it might be wise to fix the problem in a way that doesn't require index hints, a quick solution is this:
SELECT MIN(Id) FROM Connection WITH(NOLOCK, INDEX(PK_Connection)) WHERE DateConnection = '2012-06-26'
This forces a table scan.
Alternatively, try this although it probably produces the same problem:
select top 1 Id
from Connection
WHERE DateConnection = '2012-06-26'
order by Id
It makes sense that finding the minimum takes longer than going through all the records. Finding the minimum of an unsorted structure takes much longer than traversing it once (unsorted because MIN() doesn't take advantage of the identity column). What you could do, since you're using an identity column, is have a nested select, where you take the first record from the set of records with the specified date.
The NC index scan is issue in you case.It is using the unique non clustered index scan and then for each row that is hundred million rows it will traverse the clustered index and thus it causes millions of io's(usually say your index hieght is 4 then it might cause 100million*4 IO's +index scan of the nonclustered index leaf page).Optimizer must have chosen this index to avoid the strem aggregate to get the minimum.To find minimum there are 3 main technique,one is using index on the column for which we want min (it is efficient if there is index and in that case no calc required as soon as you get the row it is returned),2nd it could use hash aggregate (but it usually happens when you have group by) and 3rd is stream aggregate here it will scan through all the rows which are qualified and keep the min value always and return min when all rows are scanned..
Howvere, when the query without min used the clustered index scan and thus is fast as it has to read less number of page and thus less io's.
Now question is why optimizer picked up the index scan on non clustered index.I am sure it is to avoid the compuation involved in stream aggregate to find the min value using stream aggregate but in thise case not using the stream aggregate is much more costly. This depends on estimation so i guess stats are not up to date in the table.
So fist of all check whether your stats are upto date.When was the stats were updated last?
Thus to avoid the issue.Do following
1. First update the table stats and I am sure it must remove your issue.
2. In case, you can not use update stats or update stats doesnt change the plan and still uses the NC index scan then you can force the clustered index scan so that it uses less IO's followed by stream aggregate to get min value.

Resources