We have a table which holds all email messages ready to send and which have already been sent. The table contains over 1 million rows.
Below is the query to find the messages which still need to be sent. After 5 errors the message is not attempted anymore and needs to be fixed manually. SentDate remains null until the message is sent.
SELECT TOP (15)
ID,
FromEmailAddress,
FromEmailDisplayName,
ReplyToEmailAddress,
ToEmailAddresses,
CCEmailAddresses,
BCCEmailAddresses,
[Subject],
Body,
AttachmentUrl
FROM sysEmailMessage
WHERE ErrorCount < 5
AND SentDate IS NULL
ORDER BY CreatedDate
The query is slow, I assumed due to lacking indexes. I've offered the query to the Database Engine Tuning Advisor. It suggests the below index (and some statistics, which I generally ignore):
SET ANSI_PADDING ON
CREATE NONCLUSTERED INDEX [_dta_index_sysEmailMessage_7_1703677117__K14_K1_K12_5_6_7_8_9_10_11_15_17_18] ON [dbo].[sysEmailMessage]
(
[SentDate] ASC,
[ID] ASC,
[ErrorCount] ASC
)
INCLUDE ( [FromEmailAddress],
[ToEmailAddresses],
[CCEmailAddresses],
[BCCEmailAddresses],
[Subject],
[Body],
[AttachmentUrl],
[CreatedDate],
[FromEmailDisplayName],
[ReplyToEmailAddress]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
(On a sidenote: this index has a suggested size of 5,850,573 KB (?) which is neary 6 GB and doesn't make any sense to me at all.)
My question is does this suggested index make any sense? Why for example is the ID column included, while it's not needed in the query (as far as I can tell)?
As far as my knowledge of indexes goes they are meant to be a fast lookup to find the relevant row. If I had to design the index myself I would come up with something like:
SET ANSI_PADDING ON
CREATE NONCLUSTERED INDEX [index_alternative_a] ON [dbo].[sysEmailMessage]
(
[SentDate] ASC,
[ErrorCount] ASC
)
WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
Is the optimizer really clever or is my index more efficient and probably better?
There's 2 different aspects for selecting an index, the fields you need for finding the rows (=actual indexed fields), and the fields that are needed after that (=included fields). If you're always doing top 15 rows, you can totally ignore included fields because 15 keylookups will be fast -- and adding the whole email to the index would make it huge.
For the indexed fields, it's quite important to know how big percentage of the data matches your criteria.
Assuming almost all of your rows have ErrorCount < 5, you should not have it in the index -- but if it's a rare case, then it's good to have.
Assuming SentDate is really rarely NULL, then you should have that as the first column of the index.
Having CreatedDate in the index depends on how many rows on average the are found from the table with the ErrorCount and SentDate criteria. If it is a lot (thousands) then it might help to have it there so the newest can be found fast.
But like always, several things affect the performance so you should test how different options affect your environment.
Related
I have created a table with the following columns. All columns are unique key (column) there is no primary key in my table.
Table Product:
Bat_Key,
product_no,
value,
pgm_name,
status,
industry,
created_by,
created_date
I have altered my table to add constraints
ALTER TABLE [dbo].[Product]
ADD CONSTRAINT [PRODUCT_PK]
PRIMARY KEY NONCLUSTERED ([Bat_Key] ASC, [product_no] ASC,
[value] ASC, [pgm_name] ASC, )
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
GO
And if I created indexes as below:
CREATE NONCLUSTERED INDEX [PRODUCT_BKEY_PNO_IDX]
ON [dbo].[PRODUCT] ([Bat_Key] ASC, [product_no] ASC, [value], [pgm_name])
INCLUDE ([status], [industry])
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
GO
Whether this design is good for the following select queries :
select *
from Product
where Bat_Key = ? and product_no=?
order by product_no, pgm_name;
select *
from Product
where Bat_Key = ? and product_no=? and pgm_name = ? and value = ?
select *
from Product
where Bat_Key = ? and product_no=?
delete from Product
where Bat_Key = ? and product_no=?
or should I create different indexes based on my where clauses?
A clustered index is very different from a non-clustered index. Effectively, types both indexes contain the data sorted according to the columns you specify. However,
The clustered index also contains the rest of the data in the table (except for a few things like nvarchar(max)). You can consider this to be how it's saved in the database
Non-clustered indexes only contain the columns you have included in the index
If you don't have a clustered index, you have a 'heap'. Instead of a PK, they have a row identifiers built in.
In your case, as your primary key is non-clustered, it makes no sense to make another index with the same fields. To read the data, it must get the row identifier(s) from your PK, then go and read the data from the heap.
If, on the other hand, your primary key is clustered (which is the default), having a non-clustered index on the fields can be useful in some circumstances. But note that every non-clustered index you add can also slow down updates, inserts and deletes (as the indexes must be maintained as well).
In your example - say you had a field there which was a varchar(8000) on the row which contains a lot of information. To even read one row from the clustered index, it must read (say) 100 bytes from the other fields, and up to 8000 bytes from that new field. In other words, it multiplies the amount you need to read by 80x.
I have a tendency to have see tables having two types of data
Data you aggregate
Data you only care about on a row-by-row level
For example, in a transaction table, you may have transaction_id, transaction_date, transaction_amount, transaction_description, transaction_entered_by_user_id.
In most cases, whenever you're getting totals etc, you'll frequently need transaction amounts, date when looking at totals (e.g., what was the total of transactions this week?)
On the other hand, the description and user_id are only used when you refer to specific rows (e.g., who did this specific transaction?)
In these cases, I often put a non-clustered index on the fields used in aggregation, even if they overlap with the clustered index. It just reduces the amounts of reads required.
A really good video on this is by Brent Ozar called How to think like the SQL Server Engine - I strongly recomment it as it helped me a lot in understanding how indexes are used.
Regarding your specific examples - there are two things to look for in indexes:
The ability to 'seek' to a specific point in the data set (based on the sort of the index).
Capability to reduce amount to be read.
In terms of allowing seeks, you need to sort the index in the most appropriate way. When doing it for filtering (e.g., WHERE clauses, JOINs, one rule of thumb is to first look for 'exact' matches. For these, it doesn't matter what order they are in, as long as they have all the ones up to that point.
In your case, you have
where Bat_Key = ? and product_no=?
where Bat_Key = ? and product_no=? and pgm_name = ? and value = ?
This suggests your first two fields should be Bat_Key and product_no (in either order). Then you can also have pgm_name and value (also in either order).
You also have
where Bat_Key = ? and product_no=?
order by product_no, pgm_name;
which suggests to me that the third field should be pgm_name (as an index on Bat_Key, product_no and pgm_name would provide what you need there).
However - and this is a big however - you have lots of *s in there e.g.,
select *
from Product
where Bat_Key = ? and product_no=?
Because you are selecting *, any index that is not the clustered index needs to also go back to the actual rows to get the rest of what's included in the *.
As these want all the fields from the table (not just the ones in the index) it will need to go back to the heap (in your case). If you had a clustered index on the fields above, as well as a non-clustered index, it would have to read from the clustered index anyway because information is in there that is needed for your query.
Once again - the video above - explains this much better than I do.
Therefore, in your case, I suggest the following Primary Key
ADD CONSTRAINT [PRODUCT_PK]
PRIMARY KEY CLUSTERED ([Bat_Key] ASC, [product_no] ASC,
[pgm_name] ASC, [value] ASC)
Differences
It is clustered rather than non-clustered
The order of the 3rd and 4th fields are rearranged to help with the order by pgm_name
No real need for a second non-clustered index as there is not much other stuff to be read.
I'm trying to optimize a table lookup because the execution plan shows a pretty hefty parallelized table scan. The table is called Opportunity and the column I'm filtering on is Name. Specifically I want all rows that don't have "Supplement" as part of the Name:
WHERE ([Name] NOT LIKE '%Supplement%');
I was looking around for a way to optimize this and came across filtered indexes which is what I need, but they don't seem to like the LIKE keyword. Is there an alternate way to creating a filtered index like this?
The table has ~53k rows, and when directly querying the server it takes 4 seconds to get the data, but when I query it as a linked server (which is what I need) it takes 2 minutes. In an attempt to improve this time, I moved the query out of my script that was talking to the linked server and created a view on the remote server. Still takes forever.
Here's what I've tried so far, but the SSMS says it's invalid:
CREATE NONCLUSTERED INDEX [FX_NotSupplementOpportunities]
ON [Opportunity]([Name])
WHERE (([Name] NOT LIKE '%Supplement%')
AND ([Name] NOT LIKE '%Suplement%')
AND ([Name] NOT LIKE '%Supplament%')
AND ([Name] NOT LIKE '%Suppliment%'));
Thanks in advance for any suggestions!
You might use a Indexes on Computed Columns
an example would be:
CREATE TABLE [dbo].[MyTab](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Text] [varchar](max) NULL,
[OK] AS (case when NOT [text] like '%abc%' then (1) else (0) end) PERSISTED NOT NULL,
CONSTRAINT [PK_MyTab] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [idx_OK] ON [dbo].[MyTab]
(
[OK] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Unfortunately, there are many limitations on what you can put in a condition of a filtered index.
I can't find anything specific in MSDN but this blog post by Brent Ozar: What You Can (and Can’t) Do With Filtered Indexes mentions several of the limitations:
You can't use BETWEEN, NOT IN, CASE expressions, OR.
They don't mention LIKE but simple testing (as you did) confirms that you can't either. You can't even use (NOT (a >= 7)) which can be rewritten to the allowed (a < 7).
One thought would be to use a case expression on a persisted column and then use the persisted columns in the filtered index - but that's another limitation of the filtered indexes!
So, what you do? The only thing that comes to mind is to create a persisted column and use it in a simple (not filtered) index. Something like:
ALTER TABLE dbo.Opportunity
ADD special_condition AS (CASE WHEN [Name] NOT LIKE '%Supplement%'
THEN 1 ELSE 0 END)
PERSISTED;
Then add an index, using the column:
CREATE NONCLUSTERED INDEX FX_NotSupplementOpportunities
ON dbo.Opportunity
(special_condition, [Name]) ;
and use the (WHERE special_condition = 1) in your queries.
I have chosen #bummi's answer because it was the closest to what I attempted, but it wasn't what I ended up using. A little explanation...
So, after many hours of trying to figure out how to make the query look-up faster I actually got it from a bunch of parallelized chunks down to two index seeks. I was super ecstatic about that, but in the end I had to scrap it. The problem is that the remote database is actually a backup of our Salesforce data. I had to go through some very complex table and column alterations pre and post sync that just didn't work correctly and would get erased each sync (every 10 minutes).
While doing all of that it eventually hit me that I was importing the data and then further formatting it again on my end. I decided instead to update the views on the remote server and format the data there as much as possible and then import it. So, I spent a couple of hours re-writing the SQL on both sides and I managed to get the ~25min script down to ~3min, which makes me very happy and satisfied.
In the end, although the query look-ups on the remote server are unoptimized they're still quite fast, mostly because on average there's no more than ~50k rows in most of the tables I'm touching...
I've table that contains some buy/sell data, with around 8M records in it:
CREATE TABLE [dbo].[Transactions](
[id] [int] IDENTITY(1,1) NOT NULL,
[itemId] [bigint] NOT NULL,
[dt] [datetime] NOT NULL,
[count] [int] NOT NULL,
[price] [float] NOT NULL,
[platform] [char](1) NOT NULL
) ON [PRIMARY]
Every X mins my program gets new transactions for each itemId and I need to update it. My first solution is two step DELETE+INSERT:
delete from Transactions where platform=#platform and itemid=#itemid
insert into Transactions (platform,itemid,dt,count,price) values (#platform,#itemid,#dt,#count,#price)
[...]
insert into Transactions (platform,itemid,dt,count,price) values (#platform,#itemid,#dt,#count,#price)
The problem is, that this DELETE statement takes average 5 seconds. It's much too long.
The second solution I found is to use MERGE. I've created such Stored Procedure, wchich takes Table-valued parameter:
CREATE PROCEDURE [dbo].[sp_updateTransactions]
#Table dbo.tp_Transactions readonly,
#itemId bigint,
#platform char(1)
AS
BEGIN
MERGE Transactions AS TARGET
USING #Table AS SOURCE
ON (
TARGET.[itemId] = SOURCE.[itemId] AND
TARGET.[platform] = SOURCE.[platform] AND
TARGET.[dt] = SOURCE.[dt] AND
TARGET.[count] = SOURCE.[count] AND
TARGET.[price] = SOURCE.[price] )
WHEN NOT MATCHED BY TARGET THEN
INSERT VALUES (SOURCE.[itemId],
SOURCE.[dt],
SOURCE.[count],
SOURCE.[price],
SOURCE.[platform])
WHEN NOT MATCHED BY SOURCE AND TARGET.[itemId] = #itemId AND TARGET.[platform] = #platform THEN
DELETE;
END
This procedure takes around 7 seconds with table with 70k records. So with 8M it would probably take few minutes. The bottleneck is "When not matched" - when I commented this line, this procedure runs on average 0,01 second.
So the question is: how to improve perfomance of the delete statement?
Delete is needed to make sure, that table doesn't contains transaction that as been removed in application. But it real scenario it happens really rarely, ane the true need of deleting records is less than 1 on 10000 transaction updates.
My theoretical workaround is to create additional column like "transactionDeleted bit" and use UPDATE instead of DELETE, ane then make table cleanup by batch job every X minutes or hours and Execute
delete from transactions where transactionDeleted=1
It should be faster, but I would need to update all SELECT statements in other parts of application, to use only transactionDeleted=0 records and so it also may afect application performance.
Do you know any better solution?
UPDATE: Current indexes:
CREATE NONCLUSTERED INDEX [IX1] ON [dbo].[Transactions]
(
[platform] ASC,
[ItemId] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 50) ON [PRIMARY]
CONSTRAINT [IX2] UNIQUE NONCLUSTERED
(
[ItemId] DESC,
[count] ASC,
[dt] DESC,
[platform] ASC,
[price] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
OK, here is another approach also. For a similar problem (large scan WHEN NOT MATCHED BY SOURCE then DELETE) I reduced the MERGE execute time from 806ms to 6ms!
One issue with the problem above is that the "WHEN NOT MATCHED BY SOURCE" clause is scanning the whole TARGET table.
It is not that obvious but Microsoft allows the TARGET table to be filtered (by using a CTE) BEFORE doing the merge. So in my case the TARGET rows were reduced from 250K to less than 10 rows. BIG difference.
Assuming that the above problem works with the TARGET being filtered by #itemid and #platform then the MERGE code would look like this. The changes above to the indexes would help this logic too.
WITH Transactions_CTE (itemId
,dt
,count
,price
,platform
)
AS
-- Define the CTE query that will reduce the size of the TARGET table.
(
SELECT itemId
,dt
,count
,price
,platform
FROM Transactions
WHERE itemId = #itemId
AND platform = #platform
)
MERGE Transactions_CTE AS TARGET
USING #Table AS SOURCE
ON (
TARGET.[itemId] = SOURCE.[itemId]
AND TARGET.[platform] = SOURCE.[platform]
AND TARGET.[dt] = SOURCE.[dt]
AND TARGET.[count] = SOURCE.[count]
AND TARGET.[price] = SOURCE.[price]
)
WHEN NOT MATCHED BY TARGET THEN
INSERT
VALUES (
SOURCE.[itemId]
,SOURCE.[dt]
,SOURCE.[count]
,SOURCE.[price]
,SOURCE.[platform]
)
WHEN NOT MATCHED BY SOURCE THEN
DELETE;
Using a BIT field for IsDeleted (or IsActive as many people do) is valid but it does require modifying all code plus creating a separate SQL Job to periodically come through and remove the "deleted" records. This might be the way to go but there is something less intrusive to try first.
I noticed in your set of 2 indexes that neither is CLUSTERED. Can I assume that the IDENTITY field is? You might consider making the [IX2] UNIQUE index the CLUSTERED one and changing the PK (again, I assume the IDENTITY field is a CLUSTERED PK) to be NONCLUSTERED. I would also reorder the IX2 fields to put [Platform] and [ItemID] first. Since your main operation is looking for [Platform] and [ItemID] as a set, physically ordering them this way might help. And since this index is unique, that is a good candidate for being CLUSTERED. It is certainly worth testing as this will impact all queries against the table.
Also, if changing the indexes as I have suggested helps, it still might be worth trying both ideas and hence doing the IsDeleted field as well to see if that increases performance even more.
EDIT:
I forgot to mention, by making the IX2 index CLUSTERED and moving the [Platform] field to the top, you should get rid of the IX1 index.
EDIT2:
Just to be very clear, I am suggesting something like:
CREATE UNIQUE CLUSTERED INDEX [IX2]
(
[ItemId] DESC,
[platform] ASC,
[count] ASC,
[dt] DESC,
[price] ASC
)
And to be fair, changing which index is CLUSTERED could also negatively impact queries where JOINs are done on the [id] field which is why you need to test thoroughly. In the end you need to tune the system for your most frequent and/or expensive queries and might have to accept that some queries will be slower as a result but that might be worth this operation being much faster.
See this https://stackoverflow.com/questions/3685141/how-to-....
would the update be the same cost as a delete? No. The update would be
a much lighter operation, especially if you had an index on the PK
(errrr, that's a guid, not an int). The point being that an update to
a bit field is much less expensive. A (mass) delete would force a
reshuffle of the data.
In light of this information, your idea to use a bit field is very valid.
Imagine we have a set of entities each of which has its state: free, busy or broken. The state is specified for a day, for example, today on 2011-05-17 an entity E1 is free and tomorrow on 2011-05-18 it is busy.
There is a need to store ~10^5 entities for 1000 days. Which is the best way to do so?
I am thinking about 2 options:
represent each day as a character "0", "1" or "2" and store for every entity a string of 1000 characters
store each day with entity's state in a row, i.e. 1000 rows for an entity
The most important query for such data is: given start date and end date identify which entities are free.
Performance is of higher priority than storage.
All suggestions and comments are welcome.
The best way is to try the simpler and more flexible option first (that is, store each day in its own row) and only devise a sophisticated alternative method if the performance is unsatisfactory. Avoid premature optimization.
10^8 rows isn't such a big deal for your average database on a commodity server nowadays. Put an index on the date, and I would bet that range queries ("given start date and end date...") will work just fine.
The reasons I claim that this is both simpler and more flexible than the idea of storing a string of 1000 characters are:
You'll have to process this in code, and that code would not be as straightforward to understand as code that queries DB records that contain date and status.
Depending on the database engine, 1000 character strings may be blobs that are stored outside of the record. That makes them less efficient.
What happens if you suddenly need 2,000 days instead of 1,000? Start updating all the rows and the code that processes them? That's much more work than just changing your query.
What happens when you're next asked to store some additional information per daily record, or need to change the granularity (move from days to hours for example)?
Create a single table to hold your data. Create the table with an ID, Date, Entity name and eight boolean fields. SQL Server 2008 gave me the code below for the table:
CREATE TABLE [dbo].[EntityAvailability](
[EA_Id] [int] IDENTITY(1,1) NOT NULL,
[EA_Date] [date] NOT NULL,
[EA_Entity] [nchar](10) NOT NULL,
[EA_IsAvailable] [bit] NOT NULL,
[EA_IsUnAvailable] [bit] NOT NULL,
[EA_IsBroken] [bit] NOT NULL,
[EA_IsLost] [bit] NOT NULL,
[EA_IsSpare1] [bit] NOT NULL,
[EA_IsSpare2] [bit] NOT NULL,
[EA_IsSpare3] [bit] NOT NULL,
[EA_IsActive] [bit] NOT NULL,
CONSTRAINT [IX_EntityAvailability_Id] UNIQUE NONCLUSTERED
(
[EA_Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO
IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[EntityAvailability]') AND name = N'IXC_EntityAvailability_Date')
CREATE CLUSTERED INDEX [IXC_EntityAvailability_Date] ON [dbo].[EntityAvailability]
(
[EA_Date] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
The clustered index on date will perform best for your range searches. Never allow searches without a date range and there will be no need for any index other than the clustered index. The boolean fields allows eight situations using only a single byte. The row size for this table is 35 bytes. 230 rows will fit on a page. You stated you had need to store 10^5 entities for 1000 days which is 100 million. One hundred million rows will occupy 434,782 8K pages or around 3 gig.
Install the table on an SSD and you are set to go.
Depending on whether entities are more often free or not just store the dates when an entity is free or not.
Assuming you store the dates when the entity is not free then the search is where start date <= date and end_date >= date and any row matching that means that the entity is not free for that period
It sounds like you might be on the right track and I would suggest because of the sheer number of records and the emphasis on performance that you keep the schema as denormalized as possible. The fewer joins you need to make to determine the free or busy entities the better.
I would broadly go for a Kimball Star Schema (http://en.wikipedia.org/wiki/Star_schema) type structure with three tables (initially)
FactEntity (FK kStatus, kDate)
DimStatus (PK kStatus)
DimDate (PK kDate)
This can be loaded quite simply (Dims first followed by Fact(s)), and queried also very simply. Performance can be optimised by suitable indexing.
A big advantage of this design is that it is very extensible; if you want to increase the date range, or increase the number of valid states it is trivial to extend.
Other dimensions could be sensibly added e.g. DimEntity which could have richer information that gives categorical information that migth be interesting to slice/dice your Entities.
The DimDate is normally enriched by adding DayNo, MonthNo, YearNo, DayOfWeek, WeekendFlag, WeekdayFlag, PublicHolidayFlag. These allow some very interesting analyses to be performed.
As #Elad asks, what would ahppen if you added Time based information, then this also can be inforporated by a DimTime dimension having one record per hour or minute.
Apologies for my naming, as I don't have a good understanding of your data. Given more time I could come up with some better ones!
To get free entities on a date, you may try:
select
e.EntityName
, s.StateName
, x.ValidFrom
from EntityState as x
join Entity as e on e.EntityId = x.EntityId
join State as s on s.StateID = x.StateID
where StateName = 'free'
and x.ValidFrom = ( select max(z.ValidFrom)
from EntityState as z
where z.EntityID = x.EntityID
and z.ValidFrom <= your_date_here )
;
Note: Make sure you store only state changes in EntityState table.
I'm experiencing massive slowness when accessing one of my tables and I need some re-factoring advice. Sorry if this is not the correct area for this sort of thing.
I'm working on a project that aims to report on server performance statistics for our internal servers. I'm processing windows performance logs every night (12 servers, 10 performance counters and logging every 15 seconds). I'm storing the data in a table as follows:
CREATE TABLE [dbo].[log](
[id] [int] IDENTITY(1,1) NOT NULL,
[logfile_id] [int] NOT NULL,
[test_id] [int] NOT NULL,
[timestamp] [datetime] NOT NULL,
[value] [float] NOT NULL,
CONSTRAINT [PK_log] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH FILLFACTOR = 90 ON [PRIMARY]
) ON [PRIMARY]
There's currently 16,529,131 rows and it will keep on growing.
I access the data to produce reports and create graphs from coldfusion like so:
SET NOCOUNT ON
CREATE TABLE ##RowNumber ( RowNumber int IDENTITY (1, 1), log_id char(9) )
INSERT ##RowNumber (log_id)
SELECT l.id
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
select rn.RowNumber, l.value, l.timestamp
from log l, logfile lf, ##RowNumber rn
where lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#
and l.logfile_id = lf.id
and rn.log_id = l.id
and ((rn.rownumber % #modu# = 0) or (rn.rownumber = 1))
order by l.timestamp asc
DROP TABLE ##RowNumber
SET NOCOUNT OFF
(for not CF devs #value# inserts value and ## maps to #)
I basically create a temporary table so that I can use the rownumber to select every x rows. In this way I'm only selecting the amount of rows I can display. this helps but it's still very slow.
SQL Server Management Studio tells me my index's are as follows (I have pretty much no knowledge about using index's properly):
IX_logfile_id (Non-Unique, Non-Clustered)
IX_test_id (Non-Unique, Non-Clustered)
IX_timestamp (Non-Unique, Non-Clustered)
PK_log (Clustered)
I would be very grateful to anyone who could give some advice that could help me speed things up a bit. I don't mind re-organising things and I have complete control of the project (perhaps not over the server hardware though).
Cheers (sorry for the long post)
Your problem is that you chose a bad clustered key. Nobody is ever interested in retrieving one particular log value by ID. I your system is like anything else I've seen, then all queries are going to ask for:
all counters for all servers over a range of dates
specific counter values over all servers for a range of dates
all counters for one server over a range of dates
specific counter for specific server over a range of dates
Given the size of the table, all your non-clustered indexes are useless. They are all going to hit the index tipping point, guaranteed, so they might just as well not exists. I assume all your non-clustered index are defined as a simple index over the field in the name, with no include fields.
I'm going to pretend I actually know your requirements. You must forget common sense about storage and actually duplicate all your data in every non-clustered index. Here is my advice:
Drop the clustered index on [id], is a as useless as is it gets.
Organize the table with a clustered index (logfile_it, test_id, timestamp).
Non-clusterd index on (test_id, logfile_id, timestamp) include (value)
NC index on (logfile_id, timestamp) include (value)
NC index on (test_id, timestamp) include (value)
NC index on (timestamp) include (value)
Add maintenance tasks to reorganize all indexes periodically as they are prone to fragmentation
The clustered index covers the query 'history of specific counter value at a specific machine'. The non clustered indexes cover various other possible queries (all counters at a machine over time, specific counter across all machines over time etc).
You notice I did not comment anything about your query script. That is because there isn't anything in the world you can do to make the queries run faster over the table structure you have.
Now one thing you shouldn't do is actually implement my advice. I said I'm going to pretend I know your requirements. But I actually don't. I just gave an example of a possible structure. What you really should do is study the topic and figure out the correct index structure for your requirements:
General Index Design Guidelines.
Index Design Basics
Index with Included Columns
Query Types and Indexes
Also a google on 'covering index' will bring up a lot of good articles.
And of course, at the end of the day storage is not free so you'll have to balance the requirement to have a non-clustered index on every possible combination with the need to keep the size of the database in check. Luckly you have a very small and narrow table, so duplicating it over many non-clustered index is no big deal. Also I wouldn't be concerned about insert performance, 120 counters at 15 seconds each means 8-9 inserts per second, which is nothing.
A couple things come to mind.
Do you need to keep that much data? If not, consider either creating an archive table if you want to keep it (but don't create it just to join it with the primary table every time you run a query).
I would avoid using a temp table with so much data. See this article on temp table performance and how to avoid using them.
http://www.sql-server-performance.com/articles/per/derived_temp_tables_p1.aspx
It looks like you are missing an index on the server_id field. I would consider creating a covered index using this field and others. Here is an article on that as well.
http://www.sql-server-performance.com/tips/covering_indexes_p1.aspx
Edit
With that many rows in the table over such a short time frame, I would also check the indexes for fragmentation which may be a cause for slowness. In SQL Server 2000 you can use the DBCC SHOWCONTIG command.
See this link for info http://technet.microsoft.com/en-us/library/cc966523.aspx
Also, please note that I have numbered these items as 1,2,3,4 however the editor is automatically resetting them
Once when still working with sql server 2000, i needed to do some paging, and i came accross a method of paging that realy blew my mind. Have a look at this method.
DECLARE #Table TABLE(
TimeVal DATETIME
)
DECLARE #StartVal INT
DECLARE #EndVal INT
SELECT #StartVal = 51, #EndVal = 100
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
--select up to end number
SELECT TOP (#EndVal)
*
FROM #Table
ORDER BY TimeVal ASC
) PageReversed
ORDER BY TimeVal DESC
) PageVals
ORDER BY TimeVal ASC
As an example
SELECT *
FROM (
SELECT TOP (#EndVal - #StartVal + 1)
*
FROM (
SELECT TOP (#EndVal)
l.id,
l.timestamp
FROM log l, logfile lf
WHERE lf.server_id = #arguments.server_id#
and l.test_id = #arguments.test_id#"
and l.timestamp >= #arguments.report_from#
and l.timestamp < #arguments.report_to#
and l.logfile_id = lf.id
order by l.timestamp asc
) PageReversed ORDER BY timestamp DESC
) PageVals
ORDER BY timestamp ASC