We have a SQL Server 2000 in place which has about 6,000,000 records. And one column has a pdf stored in it.
The question is every month we delete about 250,000 records and insert about 250,000 on a specific day. After that there are no updates, only reads.
Question is: Is it optimal to delete 500 records and insert 500 records then delete then insert then delete and insert and so on..
Or delete 250,000 at a time and insert 250,000 in 500 batch?
Which option is optimal? and have the best memory management?
+1 for Anyone who points me to a MSDN article or something..
http://msdn.microsoft.com/en-us/library/aa178096%28v=sql.80%29.aspx
As you don't mention it it's worth underlining the standard practice for inserting/deleting/updating any very large volume of data on a rdbms of dropping all indexes before applying the changes, and reapplying them afterwards.
Two reasons for this.
It's faster for the system to rebuild the indexes in one go rather than on a by record basis (less head movement on the disk).
If you rebuild the index from scratch your subsequent accesses using it are likely to be faster as the index tree will be more likely to be better balanced.
You might want to consider partitioning. If you organize it so you can simply drop a partition at a time, deleting will take milliseconds. See http://msdn.microsoft.com/en-us/library/aa902650(v=sql.80).aspx
Related
We write daily about 1 million records into a sql server table.
Records has a insertdate and status fields, among others of course.
I need to delete records from time to time to free space on the volume but leaving the last 4 days records there.
The problem is the deletion takes hours and lots of resources.
I have think about partition tables setting the partition field on the insertdate, but I never used that kind of tables.
How can I achieve the goal using the less cpu/disk resources and having the solution the less drawbacks possible? (I assume any solution has its own drawbacks, but please explain them if you know).
There are two approaches you can take to speed up the deletion. One is to delete 10000 rows at a time so the transaction log does not grow to an enormous size. Based on some logic you keep deleting top 10000 rows until all the rows fulfill a condition are deleted. This can, depending on your system, speed up the deletes by a factor of 100.
The other approach is to create a partition on the table. You have to create a partition schema and function and if all the rows you are deleting are in one partition, let's say a days worth of sales, then deleting the partition will remove all the rows in a meta operation and it will only take a few seconds to do. Partitioning is not hard but you have to spend some time to set up a rolling window properly. It's more then an hour but less then a week.
I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.
Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.
I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.
I am running an archive script which deletes rows from a large (~50m record DB) based on the date they were entered. The date field is the clustered index on the table, and thus what I'm applying my conditional statement to.
I am running this delete in a while loop, trying anything from 1000 to 100,000 records in a batch. Regardless of batch size, it is surprisingly slow; something like 10,000 records getting deleted a minute. Looking at the execution plan, there is a lot of time spent on "Index Delete"s. There are about 15 fields in the table, and roughly 10 of them have some sort of index on them. Is there any way to get around this issue? I'm not even sure why it takes so long to do each index delete, can someone shed some light on exactly whats happening here? This is a sample of my execution plan:
alt text http://img94.imageshack.us/img94/1006/indexdelete.png
(The Sequence points to the Delete command)
This database is live and is getting inserted into often, which is why I'm hesitant to use the copy and truncate method of trimming the size. Is there any other options I'm missing here?
Deleting 10k records from a clustered index + 5 non clustered ones should definetely not take 1 minute. Sounds like you have a really really slow IO subsytem. What are the values for:
Avg. Disk sec/Write
Avg. Disk sec/Read
Avg. Disk Write Queue Length
Avg. Disk Read Queue Length
On each drive involved in the operation (including the Log ones!). If you placed indexes in separate filegroups and allocated each filegroup to its own LUN or own disk, then you can identify which indexes are more problematic. Also, the log flush may be a major bottleneck. SQL Server doesn't have much control here, is all in your own hands how to speed things up. that time is not spent in CPU cycles, is spent waiting for IO to complete and you need an IO subsystem calibrated for the load you demand.
To reduce the IO load you should look into making indexes narrower. Primarily, make sure the clustered index is the narrowest possible that works. Then, make sure the nonclustered indexes don't include sporious unused large columns (I've seen that...). A major gain may be had by enabling page compression. And ultimately, inspect index usage stats in sys.dm_db_index_usage_stats and see if any index is good for the axe.
If you can't reduce the IO load much, you should try to split it. Add filegroups to the database, move large indexes on separate filegroups, place the filegroups on separate IO paths (distinct spindles).
For future regular delete operations, the best alternative is to use partition switching, have all indexes aligned with the clustered index partitioning and when the time is due, just drop the last partition for a lightning fast deletion.
Assume for each record in the table there are 5 index records.
Now each delete is in essence 5 operations.
Add to that, you have a clustered index. Notice the clustered index delete time is huge? (10x) longer than the other indexes? This is because your data is being reorganized with every record deleted.
I would suggest dropping at least that index, doing a mass delete, than reapplying. Index operations on delete and insert are inherently costly. A single rebuild is likely a lot faster.
I second the suggestion that #NickLarsen made in a comment. Find out if you have unused indexes and drop them. This could reduce the overhead of those index-deletes, which might be enough of an improvement to make the operation more timely.
Another more radical strategy is to drop all the indexes, perform your deletes, and then quickly recreate the indexes for the now smaller data set. This doesn't necessarily interrupt service, but it could probably make queries a lot slower in the meantime. Though I am not a Microsoft SQL Server expert, so you should take my advice on this strategy with a grain of salt.
More of a workaround, but can you add an IsDeleted flag to the table and update that to 1 rather than deleting the rows? You will need to modify your SELECTs and UPDATEs to use this flag.
Then you can schedule deletion or archiving of these records for off-hours.
It would take some work to implement it given this is in production, but if you are on SQL Server 2005 / 2008 you should investigate and convert the table to being partitioned, then the removal of old data can be achieved extremely quickly. It is designed for a 'rolling window' type effect and prevents large scale deletes tieing up a table / process.
Unfortunately with the table in production, migrating it across to this technique will take some T-SQL coding, knowledge and a weekend to upgrade / migrate it. Once in place though any existing selects and inserts will work against it seamlessly, the partition maintenance and addition / removal is where you need the t-sql to control the process.
In a comment I read
Just as a side note, it's sometimes faster to drop the indices of your table and recreate them after the bulk insert operation.
Is this true? Under which circumstances?
As with Joel I will echo the statement that yes it can be true. I've found that the key to identifying the scenario that he mentioned is all in the distribution of data, and the size of the index(es) that you have on the specific table.
In an application that I used to support that did a regular bulk import of 1.8 million rows, with 4 indexes on the table, 1 with 11 columns, and a total of 90 columns in the table. The import with indexes took over 20 hours to complete. Dropping the indexes, inserting, and re-creating the indexes only took 1 hour and 25 minutes.
So it can be a big help, but a lot of it comes down to your data, the indexes, and the distribution of data values.
Yes, it is true. When there are indexes on the table during an insert, the server will need to be constantly re-ordering/paging the table to keep the indexes up to date. If you drop the indexes, it can just add the rows without worrying about that, and then build the indexes all at once when you re-create them.
The exception, of course, is when the import data is already in index order. In fact, I should note that I'm working on a project right now where this opposite effect was observed. We wanted to reduce the run-time of a large import (nightly dump from a mainframe system). We tried removing the indexes, importing the data, and re-creating them. It actually significantly increased the time for the import to complete. But, this is not typical. It just goes to show that you should always test first for your particular system.
One thing you should consider when dropping and recreating indexes is that it should only be done on automated processes that run during the low volumne periods of database use. While the index is dropped it can't be used for other queries that other users might be riunning at the same time. If you do this during production hours ,your users will probably start complaining of timeouts.
I have a table with more than a millon rows. This table is used to index tiff images. Each image has fields like date, number, etc. I have users that index these images in batches of 500. I need to know if it is better to first insert 500 rows and then perform 500 updates or, when the user finishes indexing, to do the 500 inserts with all the data. A very important thing is that if I do the 500 inserts at first, this time is free for me because I can do it the night before.
So the question is: is it better to do inserts or inserts and updates, and why? I have defined a id value for each image, and I also have other indices on the fields.
Updates in Sql server result in ghosted rows - i.e. Sql crosses one row out and puts a new one in. The crossed out row is deleted later.
Both inserts and updates can cause page-splits in this way, they both effectively 'add' data, it's just that updates flag the old stuff out first.
On top of this updates need to look up the row first, which for lots of data can take longer than the update.
Inserts will just about always be quicker, especially if they are either in order or if the underlying table doesn't have a clustered index.
When inserting larger amounts of data into a table look at the current indexes - they can take a while to change and build. Adding values in the middle of an index is always slower.
You can think of it like appending to an address book: Mr Z can just be added to the last page, while you'll have to find space in the middle for Mr M.
Doing the inserts first and then the updates does seem to be a better idea for several reasons. You will be inserting at a time of low transaction volume. Since inserts have more data, this is a better time to do it.
Since you are using an id value (which is presumably indexed) for updates, the overhead of updates will be very low. You would also have less data during your updates.
You could also turn off transactions at the batch (500 inserts/updates) level and use it for each individual record, thus reducing some overhead.
Finally, test this out to see the actual performance on your server before making a final decision.
This isn't a cut and dry question. Krishna's and Galegian's points are spot on.
For updates, the impact will be lessened if the updates are affecting fixed-length fields. If updating varchar or blob fields, you may add a cost of page splits during update when the new value surpasses the length of the old value.
I think inserts will run faster. They do not require a lookup (when you do an update you are basically doing the equivalent of a select with the where clause). And also, an insert won't lock the rows the way an update will, so it won't interfere with any selects that are happening against the table at the same time.
The execution plan for each query will tell you which one should be more expensive. The real limiting factor will be the writes to disk, so you may need to run some tests while running perfmon to see which query causes more writes and causes the disk queue to get the longest (longer is bad).
I'm not a database guy, but I imagine doing the inserts in one shot would be faster because the updates require a lookup whereas the inserts do not.