We have a database where logs about customer activities are saving every second.
On the other hand, the dashboard site selects recent data from this database every second. So there are hundreds of insert and select queries executed every second.
Should I use indexes on the database to reduce select statement execution time?
I would avoid indexes.
If you are logging in real time, every user action that gets logged requires an insert statement. Indexes will slow down the inserts substantially from O(1) to O(log n) where log n is the size of the table. If your application is synchronous and single threaded... This will be bad.
I would ask myself these questions:
Do I need real time dashboards.
If so can I make logging occur on a separate thread or separate server so the slow insert speed isn't an issue?
Without indexes your selects are going to become slower and slower over time as more data needs to be scanned. This is going to be unacceptable. Eventually you'll be forced to index anyway.
Add indexes.
Probably not. As a believer of having a clustered index on (almost) every table (whether every table should have a clustered index has become a religious debate for some people) I used to be in school of thought that every table should have a clustered index. But due to very difficult issues I have discovered that while very useful for selects it can (will ?) slow down insertions so much that it will hurt the application.
You probably want to test, with, and without, a clustered index to see if this happens. I also have noticed that at a certain point (between 8 and 12 indexes on a table) insertions can really slow down so even if you decide to go with the indexes (which can be a fine idea) too many can slow down your app. Test, and monitor, your app carefully.
You can get the best of both worlds (no indexes on the log table so its fast and a table with lots of indexes for reporting) by using CDC (change data capture) on the log table. This is a low impact way of copying all changes to another table which can have indexes up to kazoo for fast selects with very little impact on the log table.
Related
I'm new to SQL Server, trying to optimize a procedure I received from ex-colleague (and I can't ask him).
At the final step, the procedure updates a large table using a MERGE statement. After that, it drops two nonclustered indexes and creates them again. What is the purpose of doing that? Aren't collected statistics for optimizer being recollected regularly? Is recreating indexes the only way to provide optimizer with fresh statistics?
Thanks
After that, it drops two nonclustered indexes and creates them again. What is the purpose
of doing that?
To organize them into less space without possibly page splits that may have happened during the merge. Generally it is NOT needed - like at all. It MAY make sense, but it is much better to actually analyze the index statistics about page splits before doing that unless you can be sure it is beneficial on every load.
Aren't collected statistics for optimizer being recollected regularly?
They are, but reoganizing the indices may make them more efficient. As data changes, data in index pages changes and when it overflows then a page is split. This leads to the index (nto the statistics on it) being unbalanced over time, which may lead to additional IO load.
Is recreating indexes the only way to provide optimizer with fresh statistics?
No. But you do not do it for statistics in the first place. You can just update the statistics if you want this. You do it to get an efficient index.
I have noticed an interesting performance change that happens around 1,5 million entered values. Can someone give me a good explanation why this is happening?
Table is very simple. It is consisted of (bigint, bigint, bigint, bool, varbinary(max))
I have a pk clusered index on first three bigints. I insert only boolean "true" as data varbinary(max).
From that point on, performance seems pretty constant.
Legend: Y (Time in ms) | X (Inserts 10K)
I am also curios about constant relatively small (sometimes very large) spikes I have on the graph.
Actual Execution Plan from before spikes.
Legend:
Table I am inserting into: TSMDataTable
1. BigInt DataNodeID - fk
2. BigInt TS - main timestapm
3. BigInt CTS - modification timestamp
4. Bit: ICT - keeps record of last inserted value (increases read performance)
5. Data: Data
Bool value Current time stampl keeps
Enviorment
It is local.
It is not sharing any resources.
It is fixed size database (enough so it does not expand).
(Computer, 4 core, 8GB, 7200rps, Win 7).
(Sql Server 2008 R2 DC, Processor Affinity (core 1,2), 3GB, )
Have you checked the execution plan once the time goes up? The plan may change depending on statistics. Since your data grow fast, stats will change and that may trigger a different execution plan.
Nested loops are good for small amounts of data, but as you can see, the time grows with volume. The SQL query optimizer then probably switches to a hash or merge plan which is consistent for large volumes of data.
To confirm this theory quickly, try to disable statistics auto update and run your test again. You should not see the "bump" then.
EDIT: Since Falcon confirmed that performance changed due to statistics we can work out the next steps.
I guess you do a one by one insert, correct? In that case (if you cannot insert bulk) you'll be much better off inserting into a heap work table, then in regular intervals, move the rows in bulk into the target table. This is because for each inserted row, SQL has to check for key duplicates, foreign keys and other checks and sort and split pages all the time. If you can afford postponing these checks for a little later, you'll get a superb insert performance I think.
I used this method for metrics logging. Logging would go into a plain heap table with no indexes, no foreign keys, no checks. Every ten minutes, I create a new table of this kind, then with two "sp_rename"s within a transaction (swift swap) I make the full table available for processing and the new table takes the logging. Then you have the comfort of doing all the checking, sorting, splitting only once, in bulk.
Apart from this, I'm not sure how to improve your situation. You certainly need to update statistics regularly as that is a key to a good performance in general.
Might try using a single column identity clustered key and an additional unique index on those three columns, but I'm doubtful it would help much.
Might try padding the indexes - if your inserted data are not sequential. This would eliminate excessive page splitting and shuffling and fragmentation. You'll need to maintain the padding regularly which may require an off-time.
Might try to give it a HW upgrade. You'll need to figure out which component is the bottleneck. It may be the CPU or the disk - my favourite in this case. Memory not likely imho if you have one by one inserts. It should be easy then, if it's not the CPU (the line hanging on top of the graph) then it's most likely your IO holding you back. Try some better controller, better cached and faster disk...
I have a log table that will receive inserts from several web apps. I wont be doing any searching/sorting/querying of this data. I will be pulling the data out to another database to run reports. The initial table is strictly for RECEIVING the log messages.
Is there a way to ensure that the web applications don't have to wait on these inserts? For example I know that adding a lot of indexes would slow inserts, so I won't. What else is there? Should I not add a primary key? (Each night the table will be pumped to a reports DB which will have a lot of keys/indexes)
If performance is key, you may not want to write this data to a database. I think most everything will process a database write as a round-trip, but it sounds like you don't want to wait for the returned confirmation message. Check if, as S. Lott suggests, it might not be faster to just append a row to a simple text file somewhere.
If the database write is faster (or necessary, for security or other business/operational reasons), I would put no indexes on the table--and that includes a primary key. If it won't be used for reads or updates, and if you don't need relational integrity, then you just don't need a PK on this table.
To recommend the obvious: as part of the nightly reports run, clear out the contents of the table. Also, never reset the database file sizes (ye olde shrink database command); after a week or so of regular use, the database files should be as big as they'll ever need to be and you won't have to worry about the file growth performance hit.
Here are a few ideas, note for the last ones to be important you would have extremly high volumns:
do not have a primary key, it is enforced via an index
do not have any other index
Create the database large enough that you do not have any database growth
Place the database on it's own disk to avoid contention
Avoid software RAID
place the database on a mirrored disk, saves the calculating done on RAID 5
No keys,
no constraints,
no validation,
no triggers,
No calculated columns
If you can, have the services insert async, so as to not wait for the results (if that is acceptable).
You can even try to insert into a "daily" table, which should then be less records,
and then move this across before the batch runs at night.
But mostly on the table NO KEYS/Validation (PK and Unique indexes will kill you)
I am running an archive script which deletes rows from a large (~50m record DB) based on the date they were entered. The date field is the clustered index on the table, and thus what I'm applying my conditional statement to.
I am running this delete in a while loop, trying anything from 1000 to 100,000 records in a batch. Regardless of batch size, it is surprisingly slow; something like 10,000 records getting deleted a minute. Looking at the execution plan, there is a lot of time spent on "Index Delete"s. There are about 15 fields in the table, and roughly 10 of them have some sort of index on them. Is there any way to get around this issue? I'm not even sure why it takes so long to do each index delete, can someone shed some light on exactly whats happening here? This is a sample of my execution plan:
alt text http://img94.imageshack.us/img94/1006/indexdelete.png
(The Sequence points to the Delete command)
This database is live and is getting inserted into often, which is why I'm hesitant to use the copy and truncate method of trimming the size. Is there any other options I'm missing here?
Deleting 10k records from a clustered index + 5 non clustered ones should definetely not take 1 minute. Sounds like you have a really really slow IO subsytem. What are the values for:
Avg. Disk sec/Write
Avg. Disk sec/Read
Avg. Disk Write Queue Length
Avg. Disk Read Queue Length
On each drive involved in the operation (including the Log ones!). If you placed indexes in separate filegroups and allocated each filegroup to its own LUN or own disk, then you can identify which indexes are more problematic. Also, the log flush may be a major bottleneck. SQL Server doesn't have much control here, is all in your own hands how to speed things up. that time is not spent in CPU cycles, is spent waiting for IO to complete and you need an IO subsystem calibrated for the load you demand.
To reduce the IO load you should look into making indexes narrower. Primarily, make sure the clustered index is the narrowest possible that works. Then, make sure the nonclustered indexes don't include sporious unused large columns (I've seen that...). A major gain may be had by enabling page compression. And ultimately, inspect index usage stats in sys.dm_db_index_usage_stats and see if any index is good for the axe.
If you can't reduce the IO load much, you should try to split it. Add filegroups to the database, move large indexes on separate filegroups, place the filegroups on separate IO paths (distinct spindles).
For future regular delete operations, the best alternative is to use partition switching, have all indexes aligned with the clustered index partitioning and when the time is due, just drop the last partition for a lightning fast deletion.
Assume for each record in the table there are 5 index records.
Now each delete is in essence 5 operations.
Add to that, you have a clustered index. Notice the clustered index delete time is huge? (10x) longer than the other indexes? This is because your data is being reorganized with every record deleted.
I would suggest dropping at least that index, doing a mass delete, than reapplying. Index operations on delete and insert are inherently costly. A single rebuild is likely a lot faster.
I second the suggestion that #NickLarsen made in a comment. Find out if you have unused indexes and drop them. This could reduce the overhead of those index-deletes, which might be enough of an improvement to make the operation more timely.
Another more radical strategy is to drop all the indexes, perform your deletes, and then quickly recreate the indexes for the now smaller data set. This doesn't necessarily interrupt service, but it could probably make queries a lot slower in the meantime. Though I am not a Microsoft SQL Server expert, so you should take my advice on this strategy with a grain of salt.
More of a workaround, but can you add an IsDeleted flag to the table and update that to 1 rather than deleting the rows? You will need to modify your SELECTs and UPDATEs to use this flag.
Then you can schedule deletion or archiving of these records for off-hours.
It would take some work to implement it given this is in production, but if you are on SQL Server 2005 / 2008 you should investigate and convert the table to being partitioned, then the removal of old data can be achieved extremely quickly. It is designed for a 'rolling window' type effect and prevents large scale deletes tieing up a table / process.
Unfortunately with the table in production, migrating it across to this technique will take some T-SQL coding, knowledge and a weekend to upgrade / migrate it. Once in place though any existing selects and inserts will work against it seamlessly, the partition maintenance and addition / removal is where you need the t-sql to control the process.
We have a SQL table that is populated with events from our website (mostly error logging and the like.) The table has several text fields that contain all of the information about the type of event, and a date/time field that shows when the event was logged. The table is fairly large and grows by around 10-100 records per day.
Obviously, when going through this log, we often are looking for the most recent items, so I figured an obvious way to improve our search times would be to add a index to the date field. Me, I figured that while either ASC or DESC would both be great, DESC would be better since that's the way we're searching most of the time. Our DB guy said "no way"...it would be really bad, because the index table would rapidly become fragmented.
I could see why you wouldn't want to have a clustered index on date DESC, because you'd constantly be trying to insert at the beginning...but I thought with a non-clustered index it would be okay, since the records wouldn't need to be moved around. But what he's saying also makes sense...still would have to move indexes around.
But how much? And how big of a hit would it be? And even if it isn't much of a hit, maybe it's still not worth it because the performance on occasional selects just couldn't improve that much? Thoughts?
I don't think it's a bad idea - quite the contrary!
Not knowing your database system, I can't really be sure why your DB guy would think this would be a bad idea. And even so - even an ascending index on the date will be quite beneficial already (at least in the case of SQL Server).
In this case, if you do frequently query by date and usually will retrieve the most recent ones, this seems like a perfect index to me! Maybe you could make it even better by adding the second most likely selection criteria (log application? log type?) to it, so that if you specify both the date and that second criteria, the search scope would be even more limited within the index.
If I were you, I would try a few sample queries against the table without this index, and then add the non-clustered index on your logdate - first with ASC and test how your queries perform (check out their execution plans!), then try the index with DESC, and possibly try the index with LogDate and an additional criteria field, too. See how performance looks like.
Marc
Indexes speed up some queries but slow down all loads. Whether or not an index gives an overall performance improvement depends on how much it speeds up your actual query workload and how much it slows down your actual loading workload (as well as deletes and updates that modify the indexed column).
In many (probably most) applications that involve storing event data, there is a huge amount of loading going on and relatively little querying, which is primarily summary-type queries that don't benefit from indexes. In these sorts of applications, indexes often do more harm than good.
In many such applications, it is possible to do loads during off hours so even if the index gives an overall slowdown, it might be worth it to increase query speed because someone is waiting for the query output but no one waits for the load to complete. However, the index can get so large that overruns the file cache and each insert has to read and write a different leaf page from disk. At this point, loads start to require a linear number of random access disk reads and writes, which can cause it to take all day to do a load.