How to avoid table scan on SQL Server in this situation - sql-server

There are two tables, Costs and Logs. The data in Costs table can be in the millions of rows, and in Logs table it can be billions of rows.
I need to update the CostBy column in Costs table in a service task in production environment within 100 records each run.
CREATE TABLE Cost
(
C_PK uniqueidentifier primary key not null,
C_CostBy varchar(3) not null
)
CREATE TABLE Logs
(
L_PK uniqueidentifier primary key not null,
L_ParentTable varchar(255) not null, -- Table Cost and other table's name
L_ParentID uniqueidentifier not null, -- Cost's pk and other table's pk
L_Event varchar(3) not null, -- Part are 'ADD' and other event types
L_User varchar(3) not null
)
CREATE NONCLUSTERED INDEX [L_ParentID]
ON [dbo].[Costs] ([L_ParentID] ASC)
Here is the original update statement:
UPDATE TOP(100) Costs
SET CostBy = ISNULL(L_User, '~UK')
FROM Costs
LEFT JOIN Logs ON L_ParentID = C_PK AND L_Event = 'ADD'
WHERE CostBy = ''
However, the statement introducing a massive performance issue, high cost of table scan in Costs table.
My question is how to avoid the table scan in Costs table or how to optimize the update statement?
Thanks in advance.

You may want to try the following.
First, create an index on Logs, including all the relevant columns:
CREATE INDEX ix ON Logs
(
L_Parent_ID -- join condition, variable
)
INCLUDE
(
L_User -- no filter condition, but you use it your update
)
WHERE
(
L_Event = 'ADD' -- join condition, constant
)
If this is a unique index, ie. only a single row will ever exist with ADD event for a given parent ID, make sure to make this a unique index as it can dramatically improve performance.
Second, and this is a hit and miss situation, you may try with an index on Costs (CostBy) because you're only looking for empty CostBy values to update. This index will need to be updated upon your query because it's updating it, so it may slow down your query instead of speeding it up. It depends on a number of factors.
If you have an enterprise license, try both with with WITH (DATA_COMPRESSION = PAGE), it can significantly improve IO time at the expense of CPU. It depends which is your bottleneck.
Additionally, depending on the nature of your data, updating statistics may improve your queries. If there is a disproportionate number of rows with CostBy = '' to other values in there, you may benefit from full statistics on that field. Consider NORECOMPUTE if you only need them for this specific query, this one time.
CREATE STATISTICS st_Costs_CostBy
ON Costs (CostBy)
WITH FULLSCAN, NORECOMPUTE;

Related

Index and primary key in large table that doesn't have an Id column

I'm looking for guidance on the best practice for adding indexes / primary key for the following table in SQL Server.
My goal is to maximize performance mostly on selecting data, but also in inserts.
IndicatorValue
(
[IndicatorId] [uniqueidentifier] NOT NULL, -- this is a foreign key
[UnixTime] [bigint] NOT null,
[Value] [decimal](15,4) NOT NULL,
[Interval] [int] NOT NULL
)
The table will have over 10 million rows. Data is batch inserted between 5-10 thousand rows at a time.
I frequently query the data and retrieve the same 5-10 thousand rows at a time with SQL similar to
SELECT [UnixTime]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
or
SELECT [UnixTime], [Value]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
Based on my limited knowledge of SQL indexes, I think:
I should have a clustered index on IndicatorId and Interval. Because of the ORDER BY, should it also include UnixTime?
As I don't have an identity column (didn't create one because I wouldn't use it), I could have a non-clustered primary key on IndicatorId, UnixTime and Interval, because I read that it's always good to have PK on every table.
Also, the data is very rarely deleted, and there are not many updates, but when they happen it's only on 1 row.
Any insight on best practices would be much appreciated.

Batching / Splitting a PostgreSQL database

I am working on a project which processes data in batches and fills up a PostgreSQL (9.6, but I could upgrade) database. The way it currently works is that the process happens in separate steps and each step adds data to a table that it owns (rarely two processes write in the same table, if they do, they write in different column).
The way the data happens to be, the data tends to become more and more fine-grained with each step. As a simplified example I have one table defining the data sources. There are very few (in the tens/ low hundreds), but each of these data sources generate batches of data samples (batches and samples are separate tables, to store metadata). Each batch typically generates about 50k samples. Each of these data points then gets processed step-by-step and each data sample generates more data-points in the next table.
This worked fine, until we got to a 1.5mil rows in the sample table (which is not a lot of data from our point of view). Now filtering for a batch starts becoming slow (about 10ms for each sample we retrieve). And it starts becoming a major bottleneck, because the execution time to get the data for a batch take 5-10mins (fetching is ms).
We have b-tree indices on all foreign keys that are involved for these queries.
Since our computations target the batches, I do normally not need to query across batches during the computation (this is when the query time hurts a lot at the moment). However for data-analysis reasons ad-hoc queries across batches need to remain possible.
So a very simple solution would be to generate an individual database for each batch, and somehow query across these databases when I need to. If I had only one batch in each database, obviously the filtering for a single batch would be instant and my problem would be solved (for now). However, then I would end up with thousands of databases and the data-analysis would be painful.
Within PostgreSQL, is there a way of pretending that I have separate databases for some queries? Ideally I would like to do that for each batch when I "register" a new batch.
Outside of the world of PostgreSQL, is there another database I should try for my usecase?
Edit: DDL / Schema
In our current implementation, sample_representation is the table that all processing results depend on. A batch is truly defined by a tuple of (batch.id, representation.id). The query I tried and described above as slow is (10ms for each sample, adding up to around 5 min for 50k samples)
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
We have currently somewhere around 1.5 ssamples, 2 representations, 460 batches (of which 49 have been processed, the others do not have samples associated to it), which means each batch has 30k samples in average. Some have around 50k.
The schema is below. There is some meta-data associated with all tables, but I am not querying for it in this case. The actual sample-data are stored separately on disk and not in the database, in case that makes a difference.
create table batch
(
id uuid default uuid_generate_v1mc() not null
constraint batch_pk
primary key,
path text not null
constraint unique_batch_path
unique,
id_data_source uuid
)
;
create table sample
(
id uuid default uuid_generate_v1mc() not null
constraint sample_pk
primary key,
sample_pos integer,
id_batch uuid
constraint batch_fk
references batch
on update cascade on delete set null
)
;
create index sample_sample_pos_index
on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
on sample (id_batch, sample_pos)
;
create table representation
(
id uuid default uuid_generate_v1mc() not null
constraint representation_pk
primary key,
id_data_source uuid
)
;
create table data_source
(
id uuid default uuid_generate_v1mc() not null
constraint data_source_pk
primary key
)
;
alter table batch
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
alter table representation
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
create table sample_representation
(
id uuid default uuid_generate_v1mc() not null
constraint sample_representation_pk
primary key,
id_sample uuid
constraint sample_fk
references sample
on update cascade on delete set null,
id_representation uuid
constraint representation_fk
references representation
on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
on sample_representation (id_representation)
;
After fiddling around, I found a solution. But I am still not sure why the original query really takes that much time:
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
Everything is indexed, but the tables are relatively big with 1.5 million rows in sample_representation and in sample. I guess what happens is that first the tables get joined and then filtered with WHERE. But even if creating a large view as a result of the join, it should not take that long?!
In any case, I tried to use a CTE instead of joining two "massive" tables. The idea was to filter early and then join afterwards:
WITH sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
), sel_samples AS (
SELECT *
FROM sample
WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91'
)
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation
This query also takes forever. Here the reason is clear. sel_samples and sel_samplerepresentation have 50k records each. The join happens on a non-indexed column of the CTEs.
Since there are no indices for CTEs, I reformulated them as materialized views for which I can add indices:
CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
);
CREATE MATERIALIZED VIEW sel_samples AS (
SELECT *
FROM sample
WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91'
);
CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample);
CREATE INDEX sel_samples_id_index ON sel_samples (id);
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample;
DROP MATERIALIZED VIEW sel_samplerepresentation;
DROP MATERIALIZED VIEW sel_samples;
This is more of a hack than a solution, but executing these queries takes 1s! (down from 8min)

Querying minimum value in SQL Server is a lot longer than querying all the rows

I'm currently confronted with a strange behaviour in my database when I'm querying a minimum ID for a specific date in a table contains about a hundred million rows. The query is quite simple :
SELECT MIN(Id) FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
This query nevers end, at least I let it run for hours. The DateConnection column is not an index neither included in one. So I would understand that this query can last quite a bit. But I tried the following query which runs in few seconds :
SELECT Id FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
It returns 300k rows.
My table is defined as this :
CREATE TABLE [dbo].[Connection](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[DateConnection] [datetime] NOT NULL,
[TimeConnection] [time](7) NOT NULL,
[Hour] AS (datepart(hour,[TimeConnection])) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection] PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
)
And it has the following index :
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id] ON [dbo].[Connection]
(
[Id] ASC
)ON [PRIMARY]
One solutions I find using this strange behaviour is using the following code. But it seems to me quite a bit heavy for such a simple query.
create table #TempId
(
[Id] bigint
)
go
insert into #TempId
select id from partitionned_connection with(nolock) where dateconnection = '2012-06-26'
declare #displayId bigint
select #displayId = min(Id) from #CoIdTest
print #displayId
go
drop table #TempId
go
Has anybody been confronted to this behaviour and what is the cause of it ? Is the minimum aggregate scanning the entire table ? And if this is the case why the simple select does not ?
The root cause of the problem is the non-aligned nonclustered index, combined with the statistical limitation Martin Smith points out (see his answer to another question for details).
Your table is partitioned on [Hour] along these lines:
CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23);
CREATE PARTITION SCHEME PS
AS PARTITION PF ALL TO ([PRIMARY]);
-- Partitioned
CREATE TABLE dbo.Connection
(
Id bigint IDENTITY(1,1) NOT NULL,
DateConnection datetime NOT NULL,
TimeConnection time(7) NOT NULL,
[Hour] AS (DATEPART(HOUR, TimeConnection)) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection]
PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
ON PS ([Hour])
);
-- Not partitioned
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id]
ON dbo.Connection
(
Id ASC
)ON [PRIMARY];
-- Pretend there are lots of rows
UPDATE STATISTICS dbo.Connection WITH ROWCOUNT = 200000000, PAGECOUNT = 4000000;
The query and execution plan are:
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH (READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26';
The optimizer takes advantage of the index (ordered on Id) to transform the MIN aggregate to a TOP (1) - since the minimum value will by definition be the first value encountered in the ordered stream. (If the nonclustered index were also partitioned, the optimizer would not choose this strategy since the required ordering would be lost).
The slight complication is that we also need to apply the predicate in the WHERE clause, which requires a lookup to the base table to fetch the DateConnection value. The statistical limitation Martin mentions explains why the optimizer estimates it will only need to check 119 rows from the ordered index before finding one with a DateConnection value that will match the WHERE clause. The hidden correlation between DateConnection and Id values means this estimate is a very long way off.
In case you are interested, the Compute Scalar calculates which partition to perform the Key Lookup into. For each row from the nonclustered index, it computes an expression like [PtnId1000] = Scalar Operator(RangePartitionNew([dbo].[Connection].[Hour] as [c].[Hour],(1),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23))), and this is used as the leading key of the lookup seek. There is prefetching (read-ahead) on the nested loops join, but this needs to be an ordered prefetch to preserve the sorting required by the TOP (1) optimization.
Solution
We can avoid the statistical limitation (without using query hints) by finding the minimum Id for each Hour value, and then taking the minimum of the per-hour minimums:
-- Global minimum
SELECT
MinID = MIN(PerHour.MinId)
FROM
(
-- Local minimums (for each distinct hour value)
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH(READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26'
GROUP BY
c.[Hour]
) AS PerHour;
The execution plan is:
If parallelism is enabled, you will see a plan more like the following, which uses parallel index scan and multi-threaded stream aggregates to produce the result even faster:
Although it might be wise to fix the problem in a way that doesn't require index hints, a quick solution is this:
SELECT MIN(Id) FROM Connection WITH(NOLOCK, INDEX(PK_Connection)) WHERE DateConnection = '2012-06-26'
This forces a table scan.
Alternatively, try this although it probably produces the same problem:
select top 1 Id
from Connection
WHERE DateConnection = '2012-06-26'
order by Id
It makes sense that finding the minimum takes longer than going through all the records. Finding the minimum of an unsorted structure takes much longer than traversing it once (unsorted because MIN() doesn't take advantage of the identity column). What you could do, since you're using an identity column, is have a nested select, where you take the first record from the set of records with the specified date.
The NC index scan is issue in you case.It is using the unique non clustered index scan and then for each row that is hundred million rows it will traverse the clustered index and thus it causes millions of io's(usually say your index hieght is 4 then it might cause 100million*4 IO's +index scan of the nonclustered index leaf page).Optimizer must have chosen this index to avoid the strem aggregate to get the minimum.To find minimum there are 3 main technique,one is using index on the column for which we want min (it is efficient if there is index and in that case no calc required as soon as you get the row it is returned),2nd it could use hash aggregate (but it usually happens when you have group by) and 3rd is stream aggregate here it will scan through all the rows which are qualified and keep the min value always and return min when all rows are scanned..
Howvere, when the query without min used the clustered index scan and thus is fast as it has to read less number of page and thus less io's.
Now question is why optimizer picked up the index scan on non clustered index.I am sure it is to avoid the compuation involved in stream aggregate to find the min value using stream aggregate but in thise case not using the stream aggregate is much more costly. This depends on estimation so i guess stats are not up to date in the table.
So fist of all check whether your stats are upto date.When was the stats were updated last?
Thus to avoid the issue.Do following
1. First update the table stats and I am sure it must remove your issue.
2. In case, you can not use update stats or update stats doesnt change the plan and still uses the NC index scan then you can force the clustered index scan so that it uses less IO's followed by stream aggregate to get min value.

Creating a Primary Key on a temp table - When?

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?
This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).
If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.
You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.
Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.
If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!
I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.
I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc
I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.
When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Big Table Advice (SQL Server)

I'm experiencing massive slowness when accessing one of my tables and I need some re-factoring advice. Sorry if this is not the correct area for this sort of thing.
I'm working on a project that aims to report on server performance statistics for our internal servers. I'm processing windows performance logs every night (12 servers, 10 performance counters and logging every 15 seconds). I'm storing the data in a table as follows:
CREATE TABLE [dbo].[log](
[id] [int] IDENTITY(1,1) NOT NULL,
[logfile_id] [int] NOT NULL,
[test_id] [int] NOT NULL,
[timestamp] [datetime] NOT NULL,
[value] [float] NOT NULL,
CONSTRAINT [PK_log] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH FILLFACTOR = 90 ON [PRIMARY]
) ON [PRIMARY]
There's currently 16,529,131 rows and it will keep on growing.
I access the data to produce reports and create graphs from coldfusion like so:
SET NOCOUNT ON
CREATE TABLE ##RowNumber ( RowNumber int IDENTITY (1, 1), log_id char(9) )
INSERT ##RowNumber (log_id)
SELECT l.id
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
select rn.RowNumber, l.value, l.timestamp
from log l, logfile lf, ##RowNumber rn
where lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#
and l.logfile_id = lf.id
and rn.log_id = l.id
and ((rn.rownumber % #modu# = 0) or (rn.rownumber = 1))
order by l.timestamp asc
DROP TABLE ##RowNumber
SET NOCOUNT OFF
(for not CF devs #value# inserts value and ## maps to #)
I basically create a temporary table so that I can use the rownumber to select every x rows. In this way I'm only selecting the amount of rows I can display. this helps but it's still very slow.
SQL Server Management Studio tells me my index's are as follows (I have pretty much no knowledge about using index's properly):
IX_logfile_id (Non-Unique, Non-Clustered)
IX_test_id (Non-Unique, Non-Clustered)
IX_timestamp (Non-Unique, Non-Clustered)
PK_log (Clustered)
I would be very grateful to anyone who could give some advice that could help me speed things up a bit. I don't mind re-organising things and I have complete control of the project (perhaps not over the server hardware though).
Cheers (sorry for the long post)
Your problem is that you chose a bad clustered key. Nobody is ever interested in retrieving one particular log value by ID. I your system is like anything else I've seen, then all queries are going to ask for:
all counters for all servers over a range of dates
specific counter values over all servers for a range of dates
all counters for one server over a range of dates
specific counter for specific server over a range of dates
Given the size of the table, all your non-clustered indexes are useless. They are all going to hit the index tipping point, guaranteed, so they might just as well not exists. I assume all your non-clustered index are defined as a simple index over the field in the name, with no include fields.
I'm going to pretend I actually know your requirements. You must forget common sense about storage and actually duplicate all your data in every non-clustered index. Here is my advice:
Drop the clustered index on [id], is a as useless as is it gets.
Organize the table with a clustered index (logfile_it, test_id, timestamp).
Non-clusterd index on (test_id, logfile_id, timestamp) include (value)
NC index on (logfile_id, timestamp) include (value)
NC index on (test_id, timestamp) include (value)
NC index on (timestamp) include (value)
Add maintenance tasks to reorganize all indexes periodically as they are prone to fragmentation
The clustered index covers the query 'history of specific counter value at a specific machine'. The non clustered indexes cover various other possible queries (all counters at a machine over time, specific counter across all machines over time etc).
You notice I did not comment anything about your query script. That is because there isn't anything in the world you can do to make the queries run faster over the table structure you have.
Now one thing you shouldn't do is actually implement my advice. I said I'm going to pretend I know your requirements. But I actually don't. I just gave an example of a possible structure. What you really should do is study the topic and figure out the correct index structure for your requirements:
General Index Design Guidelines.
Index Design Basics
Index with Included Columns
Query Types and Indexes
Also a google on 'covering index' will bring up a lot of good articles.
And of course, at the end of the day storage is not free so you'll have to balance the requirement to have a non-clustered index on every possible combination with the need to keep the size of the database in check. Luckly you have a very small and narrow table, so duplicating it over many non-clustered index is no big deal. Also I wouldn't be concerned about insert performance, 120 counters at 15 seconds each means 8-9 inserts per second, which is nothing.
A couple things come to mind.
Do you need to keep that much data? If not, consider either creating an archive table if you want to keep it (but don't create it just to join it with the primary table every time you run a query).
I would avoid using a temp table with so much data. See this article on temp table performance and how to avoid using them.
http://www.sql-server-performance.com/articles/per/derived_temp_tables_p1.aspx
It looks like you are missing an index on the server_id field. I would consider creating a covered index using this field and others. Here is an article on that as well.
http://www.sql-server-performance.com/tips/covering_indexes_p1.aspx
Edit
With that many rows in the table over such a short time frame, I would also check the indexes for fragmentation which may be a cause for slowness. In SQL Server 2000 you can use the DBCC SHOWCONTIG command.
See this link for info http://technet.microsoft.com/en-us/library/cc966523.aspx
Also, please note that I have numbered these items as 1,2,3,4 however the editor is automatically resetting them
Once when still working with sql server 2000, i needed to do some paging, and i came accross a method of paging that realy blew my mind. Have a look at this method.
DECLARE #Table TABLE(
TimeVal DATETIME
)
DECLARE #StartVal INT
DECLARE #EndVal INT
SELECT #StartVal = 51, #EndVal = 100
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
--select up to end number
SELECT TOP (#EndVal)
*
FROM #Table
ORDER BY TimeVal ASC
) PageReversed
ORDER BY TimeVal DESC
) PageVals
ORDER BY TimeVal ASC
As an example
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
SELECT TOP (#EndVal)
l.id,
l.timestamp
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
) PageReversed ORDER BY timestamp DESC
) PageVals
ORDER BY timestamp ASC

Resources