bulk insert with or without index - sql-server

In a comment I read
Just as a side note, it's sometimes faster to drop the indices of your table and recreate them after the bulk insert operation.
Is this true? Under which circumstances?

As with Joel I will echo the statement that yes it can be true. I've found that the key to identifying the scenario that he mentioned is all in the distribution of data, and the size of the index(es) that you have on the specific table.
In an application that I used to support that did a regular bulk import of 1.8 million rows, with 4 indexes on the table, 1 with 11 columns, and a total of 90 columns in the table. The import with indexes took over 20 hours to complete. Dropping the indexes, inserting, and re-creating the indexes only took 1 hour and 25 minutes.
So it can be a big help, but a lot of it comes down to your data, the indexes, and the distribution of data values.

Yes, it is true. When there are indexes on the table during an insert, the server will need to be constantly re-ordering/paging the table to keep the indexes up to date. If you drop the indexes, it can just add the rows without worrying about that, and then build the indexes all at once when you re-create them.
The exception, of course, is when the import data is already in index order. In fact, I should note that I'm working on a project right now where this opposite effect was observed. We wanted to reduce the run-time of a large import (nightly dump from a mainframe system). We tried removing the indexes, importing the data, and re-creating them. It actually significantly increased the time for the import to complete. But, this is not typical. It just goes to show that you should always test first for your particular system.

One thing you should consider when dropping and recreating indexes is that it should only be done on automated processes that run during the low volumne periods of database use. While the index is dropped it can't be used for other queries that other users might be riunning at the same time. If you do this during production hours ,your users will probably start complaining of timeouts.

Related

Terrible performance after SHRINK on static data

In my test environment, on a copy of my 4GB production database, I archived about 20% of my data, then ran a shrink on it from the SSMS, suggesting 20% max free space.
The result was a 2.7GB database with horrid performance. A particular query is about .5s in production, and about 11s now in test. If I remove the full-text portion of the query in test, execution time is about 2 seconds.
Actual execution plan is identical between production and test.
I rebuilt all the indexes and fulltext indexes. Performance is still about the same. No actual content in the test database has changed since duplication.
Any thoughts on where I'd look for the culprit (besides just behind the keyboard? :)
EDIT: ok, repeated the process three times, same results each time... HOWEVER, the performance degrades BEFORE I run the shrink - as soon as I archive inactive records. 0 seconds before the archive, 18 after. Get 7 seconds back after rebuilding some indexes. The archive process:
Creates a new "Archive" DB
Identifies 3 types of keys to delete, storing them in table variables
Performs a select into the "Archive" DB for those three keys from 20 tables
Deleted rows from 20 "Live" tables for those three keys.
That's it. Post-archive, when I look at the execution plan 40% time is spent in the very first operation, a clustered index scan.
I'm going to delete this and repost with the question rephrased, over at the SQL site.
relocated question: https://dba.stackexchange.com/questions/22337/option-force-order-improves-performance-until-rows-are-deleted
I'm going to delete this in a few days since the question is misleading, but just in case anyone is curious as to the outcome, it was solved here:
https://dba.stackexchange.com/questions/22337/option-force-order-improves-performance-until-rows-are-deleted
The shrink wasn't the cause, I only assumed it was because of the likelihood of fragmenting data with a shrink. The real issue was that deleting rows caused a bad statistical sample of the data shape to be taken. That in turn caused the query analyzer to return a bad plan. It thought its plan would scan about 900 rows, but instead it scanned over 52,000,000.
Thanks for all the help!

Oracle Materialized view VS Physical Table

Note: Oracle 11gR2 Standard version (so no partitioning)
So I have to build a process to build reports off a table containing about 27 million records. The dilemma I'm facing is the fact that I can't create my own indexes off this table as it's a 3rd party table that we can't alter. So, I started experimenting with the use of Materialized views where I can then create my own indexes, or a physical table that would basically just be a duplicate that I'd truncate and repopulate on demand.
The advantage with the MAT view is that it's basically pulling from the "Live" table, so I don't have to worry about discrepancies as long as I refresh it before use, the problem is the refresh seems to take a significant amount of time. I then decided to try the physical table approach, where I tried truncating and repopulating (Took around 10 min), then rebuild indexes (which takes another 10, give or take).... I also tried updating with only "new" record by performing a:
INSERT... SELECT where NOT Exists (Select 1 from Table where PK = PK)
Which almost takes 10 min also regardless of my index, parallelism, etc...
Has anyone had to deal with this amount of data (which will keep growing) and found an approach that performs well and works efficiently??
Seems a view won't do.... so I'm left with those 2 options because I can't tweak indexes on my primary table, so any tips suggestions would be greatly appreciated... The whole purpose of this process was to make things "faster" for reporting, but somehow where I'm gaining performance in some areas, I end up losing in others given the amount of data I need to move around. Are there other options aside from:
Truncate / Populate Table, Rebuild indexes
Populate secondary table from primary table where PK not exist
Materialized view (Refresh, Rebuild indexes)
View that pulls from Live table (No new indexes)
Thanks in advance for any suggestions.....
Does anyone know if doing a "Create Table As Select..." perform better than "Insert... Select" if I render my indexes and such unusable when doing my insert on the second option, or should it be fairly similar?
I think that there's a lot to be said for a very simple approach on this sort of task. Consider a truncate and direct path (append) insert on the duplicate table without disabling/rebuilding indexes, with NOLOGGING set on the table. The direct path insert has a index maintenance mechanism associated with it that is possibly more efficient than running multiple index rebuilds post-load, as it logs in temporary segments the data required to build the indexes and thus avoids subsequent multiple full table scans.
If you do want to experiment with index disable/rebuild then try rebuilding all the indexes at the same time without query parallelism, as only one physical full scan will be used -- the rest of the scans will be "parasitic" in that they'll read the table blocks from memory.
When you load the duplicate table consider ordering the rows in the select so that commonly used predicates on the reports are able to access fewer blocks. For example if you commonly query on date ranges, order by the date column. Remember that a little extra time spent in building this report table can be recovered in reduced report query execution time.
Consider compressing the table also, but only if you're loading with direct path insert unless you have the pricey Advanced Compression option. Index compression and bitmap indexes are also worth considering.
Also, consider not analyzing the reporting table. Report queries commonly use multiple predicates that are not well estimated using conventional statistics, and you have to rely on dynamic sampling for good cardinality estimates anyway.
"Create Table As Select" generate lesser undo. That's an advantage.
When data is "inserted" indexes also are maintained and performance is impacted negatively.

Sql Server 2008 R2 DC Inserts Performance Change

I have noticed an interesting performance change that happens around 1,5 million entered values. Can someone give me a good explanation why this is happening?
Table is very simple. It is consisted of (bigint, bigint, bigint, bool, varbinary(max))
I have a pk clusered index on first three bigints. I insert only boolean "true" as data varbinary(max).
From that point on, performance seems pretty constant.
Legend: Y (Time in ms) | X (Inserts 10K)
I am also curios about constant relatively small (sometimes very large) spikes I have on the graph.
Actual Execution Plan from before spikes.
Legend:
Table I am inserting into: TSMDataTable
1. BigInt DataNodeID - fk
2. BigInt TS - main timestapm
3. BigInt CTS - modification timestamp
4. Bit: ICT - keeps record of last inserted value (increases read performance)
5. Data: Data
Bool value Current time stampl keeps
Enviorment
It is local.
It is not sharing any resources.
It is fixed size database (enough so it does not expand).
(Computer, 4 core, 8GB, 7200rps, Win 7).
(Sql Server 2008 R2 DC, Processor Affinity (core 1,2), 3GB, )
Have you checked the execution plan once the time goes up? The plan may change depending on statistics. Since your data grow fast, stats will change and that may trigger a different execution plan.
Nested loops are good for small amounts of data, but as you can see, the time grows with volume. The SQL query optimizer then probably switches to a hash or merge plan which is consistent for large volumes of data.
To confirm this theory quickly, try to disable statistics auto update and run your test again. You should not see the "bump" then.
EDIT: Since Falcon confirmed that performance changed due to statistics we can work out the next steps.
I guess you do a one by one insert, correct? In that case (if you cannot insert bulk) you'll be much better off inserting into a heap work table, then in regular intervals, move the rows in bulk into the target table. This is because for each inserted row, SQL has to check for key duplicates, foreign keys and other checks and sort and split pages all the time. If you can afford postponing these checks for a little later, you'll get a superb insert performance I think.
I used this method for metrics logging. Logging would go into a plain heap table with no indexes, no foreign keys, no checks. Every ten minutes, I create a new table of this kind, then with two "sp_rename"s within a transaction (swift swap) I make the full table available for processing and the new table takes the logging. Then you have the comfort of doing all the checking, sorting, splitting only once, in bulk.
Apart from this, I'm not sure how to improve your situation. You certainly need to update statistics regularly as that is a key to a good performance in general.
Might try using a single column identity clustered key and an additional unique index on those three columns, but I'm doubtful it would help much.
Might try padding the indexes - if your inserted data are not sequential. This would eliminate excessive page splitting and shuffling and fragmentation. You'll need to maintain the padding regularly which may require an off-time.
Might try to give it a HW upgrade. You'll need to figure out which component is the bottleneck. It may be the CPU or the disk - my favourite in this case. Memory not likely imho if you have one by one inserts. It should be easy then, if it's not the CPU (the line hanging on top of the graph) then it's most likely your IO holding you back. Try some better controller, better cached and faster disk...

sql server delete slowed drastically by indexes

I am running an archive script which deletes rows from a large (~50m record DB) based on the date they were entered. The date field is the clustered index on the table, and thus what I'm applying my conditional statement to.
I am running this delete in a while loop, trying anything from 1000 to 100,000 records in a batch. Regardless of batch size, it is surprisingly slow; something like 10,000 records getting deleted a minute. Looking at the execution plan, there is a lot of time spent on "Index Delete"s. There are about 15 fields in the table, and roughly 10 of them have some sort of index on them. Is there any way to get around this issue? I'm not even sure why it takes so long to do each index delete, can someone shed some light on exactly whats happening here? This is a sample of my execution plan:
alt text http://img94.imageshack.us/img94/1006/indexdelete.png
(The Sequence points to the Delete command)
This database is live and is getting inserted into often, which is why I'm hesitant to use the copy and truncate method of trimming the size. Is there any other options I'm missing here?
Deleting 10k records from a clustered index + 5 non clustered ones should definetely not take 1 minute. Sounds like you have a really really slow IO subsytem. What are the values for:
Avg. Disk sec/Write
Avg. Disk sec/Read
Avg. Disk Write Queue Length
Avg. Disk Read Queue Length
On each drive involved in the operation (including the Log ones!). If you placed indexes in separate filegroups and allocated each filegroup to its own LUN or own disk, then you can identify which indexes are more problematic. Also, the log flush may be a major bottleneck. SQL Server doesn't have much control here, is all in your own hands how to speed things up. that time is not spent in CPU cycles, is spent waiting for IO to complete and you need an IO subsystem calibrated for the load you demand.
To reduce the IO load you should look into making indexes narrower. Primarily, make sure the clustered index is the narrowest possible that works. Then, make sure the nonclustered indexes don't include sporious unused large columns (I've seen that...). A major gain may be had by enabling page compression. And ultimately, inspect index usage stats in sys.dm_db_index_usage_stats and see if any index is good for the axe.
If you can't reduce the IO load much, you should try to split it. Add filegroups to the database, move large indexes on separate filegroups, place the filegroups on separate IO paths (distinct spindles).
For future regular delete operations, the best alternative is to use partition switching, have all indexes aligned with the clustered index partitioning and when the time is due, just drop the last partition for a lightning fast deletion.
Assume for each record in the table there are 5 index records.
Now each delete is in essence 5 operations.
Add to that, you have a clustered index. Notice the clustered index delete time is huge? (10x) longer than the other indexes? This is because your data is being reorganized with every record deleted.
I would suggest dropping at least that index, doing a mass delete, than reapplying. Index operations on delete and insert are inherently costly. A single rebuild is likely a lot faster.
I second the suggestion that #NickLarsen made in a comment. Find out if you have unused indexes and drop them. This could reduce the overhead of those index-deletes, which might be enough of an improvement to make the operation more timely.
Another more radical strategy is to drop all the indexes, perform your deletes, and then quickly recreate the indexes for the now smaller data set. This doesn't necessarily interrupt service, but it could probably make queries a lot slower in the meantime. Though I am not a Microsoft SQL Server expert, so you should take my advice on this strategy with a grain of salt.
More of a workaround, but can you add an IsDeleted flag to the table and update that to 1 rather than deleting the rows? You will need to modify your SELECTs and UPDATEs to use this flag.
Then you can schedule deletion or archiving of these records for off-hours.
It would take some work to implement it given this is in production, but if you are on SQL Server 2005 / 2008 you should investigate and convert the table to being partitioned, then the removal of old data can be achieved extremely quickly. It is designed for a 'rolling window' type effect and prevents large scale deletes tieing up a table / process.
Unfortunately with the table in production, migrating it across to this technique will take some T-SQL coding, knowledge and a weekend to upgrade / migrate it. Once in place though any existing selects and inserts will work against it seamlessly, the partition maintenance and addition / removal is where you need the t-sql to control the process.

Very large tables in SQL Server

We have a very large table (> 77M records and growing) runing on SQL Server 2005 64bit Standard edition and we are seeing some performance issues. There are up to a hundred thousand records added daily.
Does anyone know if there is a limit to the number of records SQL server Standard edition can handle? Should be be considering moving to Enterprise edition or are there some tricks we can use?
Additional info:
The table in question is pretty flat (14 columns), there is a clustered index with 6 fields, and two other indexes on single fields.
We added a fourth index using 3 fields that were in a select in one problem query and did not see any difference in the estimated performance (the query is part of a process that has to run in the off hours so we don't have metrics yet). These fields are part of the clustered index.
Agreeing with Marc and Unkown above ... 6 indexes in the clustered index is way too many, especially on a table that has only 14 columns. You shouldn't have more than 3 or 4, if that, I would say 1 or maybe 2. You may know that the clustered index is the actual table on the disk so when a record is inserted, the database engine must sort it and place it in it's sorted organized place on the disk. Non clustered indexes are not, they are supporting lookup 'tables'. My VLDBs are laid out on the disk (CLUSTERED INDEX) according to the 1st point below.
Reduce your clustered index to 1 or 2. The best field choices are the IDENTITY (INT), if you have one, or a date field in which the fields are being added to the database, or some other field that is a natural sort of how your data is being added to the database. The point is you are trying to keep that data at the bottom of the table ... or have it laid out on the disk in the best (90%+) way that you'll read the records out. This makes it so that there is no reorganzing going on or that it's taking one and only one hit to get the data in the right place for the best read. Be sure to put the removed fields into non-clustered indexes so you don't lose the lookup efficacy. I have NEVER put more than 4 fields on my VLDBs. If you have fields that are being update frequently and they are included in your clustered index, OUCH, that's going to reorganize the record on the disk and cause COSTLY fragmentation.
Check the fillfactor on your indexes. The larger the fill factor number (100) the more full the data pages and index pages will be. In relation to how many records you have and how many records your are inserting you will change the fillfactor # (+ or -) of your non-clustered indexes to allow for the fill space when a record is inserted. If you change your clustered index to a sequential data field, then this won't matter as much on a clustered index. Rule of thumb (IMO), 60-70 fillfactor for high writes, 70-90 for medium writes, and 90-100 for high reads/low writes. By dropping your fillfactor to 70, will mean that for every 100 records on a page, 70 records are written, which will leave free space of 30 records for new or reorganized records. Eats up more space, but it sure beats having to DEFRAG every night (see 4 below)
Make sure the statistics exist on the table. If you want to sweep the database to create statistics using the "sp_createstats 'indexonly'", then SQL Server will create all the statistics on all the indexes that the engine has accumulated as requiring statistics. Don't leave off the 'indexonly' attribute though or you'll add statistics for every field, that would then not be good.
Check the table/indexes using DBCC SHOWCONTIG to see which indexes are getting fragmented the most. I won't go into the details here, just know that you need to do it. Then based on that information, change the fillfactor up or down in relation to the changes the indexes are experiencing change and how fast (over time).
Setup a job schedule that will do online (DBCC INDEXDEFRAG) or offline (DBCC DBREINDEX) on individual indexes to defrag them. Warning: don't do DBCC DBREINDEX on this large of a table without it being during maintenance time cause it will bring the apps down ... especially on the CLUSTERED INDEX. You've been warned. Test and test this part.
Use the execution plans to see what SCANS, and FAT PIPES exist and adjust the indexes, then defrag and rewrite stored procs to get rid of those hot spots. If you see a RED object in your execution plan, it's because there are not statistics on that field. That's bad. This step is more of the "art than the science".
On off peak times, run the UPDATE STATISTICS WITH FULLSCAN to give the query engine as much information about the data distributions as you can. Otherwise do the standard UPDATE STATISTICS (with standard 10% scan) on tables during the weeknights or more often as you see fit with your observerations to make sure the engine has more information about the data distributions to retrieve the data for efficiently.
Sorry this is so long, but it's extremely important. I've only give you here minimal information but will help a ton. There's some gut feelings and observations that go in to strategies used by these points that will require your time and testing.
No need to go to Enterprise edition. I did though in order to get the features spoken of earlier with partitioning. But I did ESPECIALLY to have much better mult-threading capabilities with searching and online DEFRAGING and maintenance ... In Enterprise edition, it is much much better and more friendly with VLDBs. Standard edition doesn't handle doing DBCC INDEXDEFRAG with online databases as well.
The first thing I'd look at is indexing. If you use the execution plan generator in Management Studio, you want to see index seeks or clustered index seeks. If you see scans, particularly table scans, you should look at indexing the columns you generally search on to see if that improves your performance.
You should certainly not need to move to Enterprise edition for this.
[there is a clustered index with 6 fields, and two other indexes on single fields.]
Without knowing any details about the fields, I would try to find a way to make the clustered index smaller.
With SQL Server, all the clustered-key fields will also be included in all the non-clustered indices (as a way to do the final lookup from non-clustered index to actual data page).
If you have six fields at 8 bytes each = 48 bytes, multiply that by two more indices times 77 million rows - and you're looking at a lot of wasted space which translates into a lot
of I/O operations (and thus degrades performance).
For the clustered index, it's absolutely CRUCIAL for it to be unique, stable, and as small as possible (preferably a single INT or such).
Marc
Do you really need to have access to all 77 million records in a single table?
For example, if you only need access to the last X months worth of data, then you could consider creating an archiving strategy. This could be used to relocate data to an archive table in order to reduce the volume of data and subsequently, query time on your 'hot' table.
This approach could be implemented in the standard edition.
If you do upgrade to the Enterprise edition you can make use of table partitioning. Again depending on your data structure this can offer significant performance improvements. Partitioning can also be used to implement the strategy previously mentioned but with less administrative overhead.
Here is an excellent White paper on table partitioning in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope what I have detailed is clear and understandable. Please do feel to contact me directly if you require further assistance.
Cheers,
http://msdn.microsoft.com/en-us/library/ms143432.aspx
You've got some room to grow.
As far as performance issues, that's a whole other question. Caching, sharding, normalizing, indexing, query tuning, app code tuning, and so on.
Standard should be able to handle it. I would look at indexing and the queries you use with the table. You want to structure things in such a way that your inserts don't cause too many index recalcs, but your queries can still take advantage of the index to limit lookups to a small portion of the table.
Beyond that, you might consider partitioning the table. This will allow you to divide the table into several logical groups. You can do it "behind-the-scenes", so it still appears in sql server as one table even though it stored separately, or you can do it manually (create a new 'archive' or yearly table and manually move over rows). Either way, only do it after you looked at the other options first, because if you don't get that right you'll still end up having to check every partition. Also: partitioning does require Enterprise Edition, so that's another reason to save this for a last resort.
In and of itself, 77M records is not a lot for SQL Server. How are you loading the 100,000 records? is that a batch load each day? or thru some sort of OLTP application? and is that the performance issue you are having, i.e adding the data? or is it the querying that giving you the most problems?
If you are adding 100K records at a time, and the records being added are forcing the cluster-index to re-org your table, that will kill your performance quickly. More details on the table structure, indexes and type of data inserted will help.
Also, the amount of ram and the speed of your disks will make a big difference, what are you running on?
maybe these are minor nits, but....
(1) relational databases don't have FIELDS... they have COLUMNS.
(2) IDENTITY columns usually mean the data isn't normalized (or the designer was lazy). Some combination of columns MUST be unique (and those columns make up the primary key)
(3) indexing on datetime columns is usually a bad idea; CLUSTERING on datetime columns is also usually a bad idea, especially an ever-increasing datetime column, as all the inserts are contending for the same physical space on disk. Clustering on datetime columns in a read-only table where that column is part of range restrictions is often a good idea (see how the ideas conflict? who said db design wasn't an art?!)
What type of disks do you have?
You might monitor some disk counters to see if requests are queuing.
You might move this table to another drive by putting it in another filegroup. You can also to the same with the indexes.
Initially I wanted to agree with Marc. The width of your clustered index seems suspect, as it will essentially be used as the key to perform lookups on all your records. The wider the clustered index, the slower the access, generally. And a six field clustered index feels really, really suspect.
Uniqueness is not required for a clustered index. In fact, the best candidates for fields that should be in the clustered index are ones that are not unique and used in joins. For example, in a Persons table where each Person belongs to one Group and you frequently join Persons to Groups, while accessing batches of people by group, Person.group_id would be an ideal candidate, for this particular use case.

Resources