When does MS-SQL maintain table indexes? - sql-server

For arguments sake, lets say it's for SQL 2005/8. I understand that when you place indexes on a table to tune SELECT statements, these indexes need maintaining during INSERT / UPDATE / DELETE actions.
My main question is this:
When will SQL Server maintain a table's indexes?
I have many subsequent questions:
I naively assume that it will do so after a command has executed. Say you are inserting 20 rows, it will maintain the index after 20 rows have been inserted and committed.
What happens in the situation where a
script features multiple statements
against a table, but are otherwise
distinct statements?
Does the server have the intelligence
to maintain the index after all
statements are executed or does it do
it per statement?
I've seen situations where indexes are dropped and recreated after large / many INSERT / UPDATE actions.
This presumably incurs rebuilding the
entire table's indexes even if you
only change a handful of rows?
Would there be a performance benefit
in attempting to collate INSERT and
UPDATE actions into a larger batch,
say by collecting rows to insert in a
temporary table, as opposed to doing
many smaller inserts?
How would collating the rows above stack up against dropping an index versus taking the maintenance hit?
Sorry for the proliferation of questions - it's something I've always known to be mindful of, but when trying to tune a script to get a balance, I find I don't actually know when index maintenance occurs.
Edit: I understand that performance questions largely depend on the amount of data during the insert/update and the number of indexes. Again for arguments sake, I'd have two situations:
An index heavy table tuned for
selects.
An index light table (PK).
Both situations would have a large insert/update batch, say, 10k+ rows.
Edit 2: I'm aware of being able to profile a given script on a data set. However, profiling doesn't tell me why a given approach is faster than another. I am more interested in the theory behind the indexes and where performance issues stem, not a definitive "this is faster than that" answer.
Thanks.

When your statement (not even transaction) is completed, all your indexes are up-to-date. When you commit, all the changes become permanent, and all locks are released. Doing otherwise would not be "intelligence", it would violate the integrity and possibly cause errors.
Edit: by "integrity" I mean this: once committed, the data should be immediately available to anyone. If the indexes are not up-to-date at that moment, someone may get incorrect results.
As you are increasing batch size, your performance originally improves, then it will slow down. You need to run your own benchmarks and find out your optimal batch size. Similarly, you need to benchmark to determine whether it is faster to drop/recreate indexes or not.
Edit: if you insert/update/delete batches of rows in one statement, your indexes are modified once per statement. The following script demonstrates that:
CREATE TABLE dbo.Num(n INT NOT NULL PRIMARY KEY);
GO
INSERT INTO dbo.Num(n)
SELECT 0
UNION ALL
SELECT 1;
GO
-- 0 updates to 1, 1 updates to 0
UPDATE dbo.Num SET n = 1-n;
GO
-- doing it row by row would fail no matter how you do it
UPDATE dbo.Num SET n = 1-n WHERE n=0;
UPDATE dbo.Num SET n = 1-n WHERE n=1;

Related

SQL Server : time-series data performance

I have a table of a little over 1 billion rows of time-series data with fantastic insert performance but (sometimes) awful select performance.
Table tblTrendDetails (PK is ordered as shown):
PK TrendTime datetime
PK CavityId int
PK TrendValueId int
TrendValue real
The table is continuously pulling in new data and purging old data, so insert and delete performance needs to remain snappy.
When executing a query such as the following, performance is poor (30 sec):
SELECT *
FROM tblTrendDetails
WHERE TrendTime BETWEEN #inMinTime AND #inMaxTime
AND CavityId = #inCavityId
AND TrendValueId = #inTrendId
If I execute the same query again (with similar times, but any #inCavityId or #inTrendId), performance is very good (1 sec). Performance counters show that disk access is the culprit the first time the query is run.
Any recommendations regarding how to improve performance without (significantly) adversely affecting the insert or delete performance? Any suggestions (including completely changing the underlying database) are welcome.
The fact that subsequent queries of the same or similar data run much faster is probably due to SQL Server caching your data. That said, is it possible to speed this initial query up?
Verify the query plan:
My guess is that your query should result in an Index Seek rather than an Index Scan (or worse, a Table Scan). Please verify this using SET SHOWPLAN_TEXT ON; or a similar feature. Using between and = as your query does should really take advantage of the clustered index, though that's debatable.
Index Fragmentation:
It is possible that your clustered index (the primary key in this case) is quite fragmented after all of those inserts and deletes. I would probably check this with DBCC SHOWCONTIG (tblTrendDetails).
You can defrag the table's indexes with DBCC INDEXDEFRAG (MyDatabase, tblTrendDetails).
This may take some time, but will allow the table to remain accessible, and you can stop the operation without any nasty side-effects.
You might have to go further and use DBCC DBREINDEX (tblTrendDetails). This is an offline operation, though, so you should only do this when the table does not need to be accessed.
There are some differences described here: Microsoft SQL Server 2000 Index Defragmentation Best Practices.
Be aware that your transaction log can grow quite a bit from defragging a large table, and it can take a long time.
Partitioned Views:
If these do not remedy the situation (or fragmentation is not a problem), you may even wish to look to partitioned views, in which you create a bunch of underlying base tables for various ranges of records, then union them all up in a view (replacing your original table).
Better Stuff:
If performance of these selects is a real business need, you may be able to make the case for better hardware: faster drives, more memory, etc. If your drives are twice as fast, then this query will run in half the time, yeah? Also, this may not be workable for you, but I've simply found newer versions of SQL Server to truly be faster with more options and better to maintain. I'm glad to have moved most of my company's data to 2008R2. But I digress...

What is the normal insert time in a medium size database on an indexed string column?

I have a sqlite database with only one table (around 50,000 rows) and I recurrently perform update-otherwise-insert operations on it using Java and sqlitejdbc (i.e. I try to update rows if they exist and insert new rows otherwise). My table is similar to a word frequency table with "word" and "frequency" columns, and without a primary key!
The problem is that I perform this update-otherwise-insert operation hundreds of thousands of times and on average the insert or update operation takes more than 2ms. There are even times when the insert operations take some 20 milliseconds. I should also mention that the table has an index on the column on which I use the "where" clause in my insert operations (the "word" column"), which naturally should make the insert operation more expensive.
Firstly I want to make sure if 2ms for an insert operation on an indexed table with 50,000 rows is normal and there isn't anything that I've missed, and after that any suggestions to improve the performance is more than welcome. It struck me that dropping the index before performing large crunches of insert operations and recreating it again afterwards is good practice, but I can't do it here because I need to check if a row with the same word already exists.
I know all the stuff about "it depends on the hardware" and "it depends on the rest of your code" etc, but I really think one CAN have an idea of how much an insert operation should take on an average pc.
I partially solved my problem. For anyone interested in an answer to this, this link will be helpful. In short, turning off journal mode in sqlite ("pragma journal_mode=OFF") improves the insert performance significantly (almost four times the previous speed in my case) to the cost of making the code prone to data loss in case of unexpected shutdown.
As for the normal insert speed, it is way faster than 2ms/operation. It can reach as high as hundreds of thousands of insert operations per second using the right pragma instructions, making best use of transactions, etc.

Sql Server 2008 R2 DC Inserts Performance Change

I have noticed an interesting performance change that happens around 1,5 million entered values. Can someone give me a good explanation why this is happening?
Table is very simple. It is consisted of (bigint, bigint, bigint, bool, varbinary(max))
I have a pk clusered index on first three bigints. I insert only boolean "true" as data varbinary(max).
From that point on, performance seems pretty constant.
Legend: Y (Time in ms) | X (Inserts 10K)
I am also curios about constant relatively small (sometimes very large) spikes I have on the graph.
Actual Execution Plan from before spikes.
Legend:
Table I am inserting into: TSMDataTable
1. BigInt DataNodeID - fk
2. BigInt TS - main timestapm
3. BigInt CTS - modification timestamp
4. Bit: ICT - keeps record of last inserted value (increases read performance)
5. Data: Data
Bool value Current time stampl keeps
Enviorment
It is local.
It is not sharing any resources.
It is fixed size database (enough so it does not expand).
(Computer, 4 core, 8GB, 7200rps, Win 7).
(Sql Server 2008 R2 DC, Processor Affinity (core 1,2), 3GB, )
Have you checked the execution plan once the time goes up? The plan may change depending on statistics. Since your data grow fast, stats will change and that may trigger a different execution plan.
Nested loops are good for small amounts of data, but as you can see, the time grows with volume. The SQL query optimizer then probably switches to a hash or merge plan which is consistent for large volumes of data.
To confirm this theory quickly, try to disable statistics auto update and run your test again. You should not see the "bump" then.
EDIT: Since Falcon confirmed that performance changed due to statistics we can work out the next steps.
I guess you do a one by one insert, correct? In that case (if you cannot insert bulk) you'll be much better off inserting into a heap work table, then in regular intervals, move the rows in bulk into the target table. This is because for each inserted row, SQL has to check for key duplicates, foreign keys and other checks and sort and split pages all the time. If you can afford postponing these checks for a little later, you'll get a superb insert performance I think.
I used this method for metrics logging. Logging would go into a plain heap table with no indexes, no foreign keys, no checks. Every ten minutes, I create a new table of this kind, then with two "sp_rename"s within a transaction (swift swap) I make the full table available for processing and the new table takes the logging. Then you have the comfort of doing all the checking, sorting, splitting only once, in bulk.
Apart from this, I'm not sure how to improve your situation. You certainly need to update statistics regularly as that is a key to a good performance in general.
Might try using a single column identity clustered key and an additional unique index on those three columns, but I'm doubtful it would help much.
Might try padding the indexes - if your inserted data are not sequential. This would eliminate excessive page splitting and shuffling and fragmentation. You'll need to maintain the padding regularly which may require an off-time.
Might try to give it a HW upgrade. You'll need to figure out which component is the bottleneck. It may be the CPU or the disk - my favourite in this case. Memory not likely imho if you have one by one inserts. It should be easy then, if it's not the CPU (the line hanging on top of the graph) then it's most likely your IO holding you back. Try some better controller, better cached and faster disk...

SQL Server 2008 Performance: No Indexes vs Bad Indexes?

i'm running into a strange problem in Microsoft SQL Server 2008.
I have a large database (20 GB) with about 10 tables and i'm attempting to make a point regarding how to correctly create indexes.
Here's my problem: on some nested queries i'm getting faster results without using indexes! It's close (one or two seconds), but in some cases using no indexes at all seems to make these queries run faster... I'm running a Checkpoiunt and a DBCC dropcleanbuffers to reset the caches before running the scripts, so I'm kinda lost.
What could be causing this?
I know for a fact that the indexes are poorly constructed (think one index per relevant field), the whole point is to prove the importance of constructing them correctly, but it should never be slower than having no indexes at all, right?
EDIT: here's one of the guilty queries:
SET STATISTICS TIME ON
SET STATISTICS IO ON
USE DBX;
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DBCC FREEPROCCACHE;
GO
SELECT * FROM Identifier where CarId in (SELECT CarID from Car where ManufactId = 14) and DataTypeId = 1
Identifier table:
- IdentifierId int not null
- CarId int not null
- DataTypeId int not null
- Alias nvarchar(300)
Car table:
- CarId int not null
- ManufactId int not null
- (several fields followed, all nvarchar(100)
Each of these bullet points has an index, along with some indexes that simultaneously store two of them at a time (e.g. CarId and DataTypeId).
Finally, The identifier table has over million entries, while the Car table has two or three million
My guess would be that SQL Server is incorrectly deciding to use an index, which is then forcing a bookmark lookup*. Usually when this happens (the incorrect use of an index) it's because the statistics on the table are incorrect.
This can especially happen if you've just loaded large amounts of data into one or more of the tables. Or, it could be that SQL Server is just screwing up. It's pretty rare that this happens (I can count on one hand the times I've had to force index use over a 15 year career with SQL Server), but the optimizer is not perfect.
* A bookmark lookup is when SQL Server finds a row that it needs on an index, but then has to go to the actual data pages to retrieve additional columns that are not in the index. If your result set returns a lot of rows this can be costly and clustered index scans can result in better performance.
One way to get rid of bookmark lookups is to use covering indexes - an index which has the filtering columns first, but then also includes any other columns which you would need in the "covered" query. For example:
SELECT
my_string1,
my_string2
FROM
My_Table
WHERE
my_date > '2000-01-01'
covering index would be (my_date, my_string1, my_string2)
Indexes don't really have any benefit until you have many records. I say many because I don't really know what that tipping over point is...It depends on the specific application and circumstances.
It does take time for the SQL Server to work with an index. If that time exceeds the benefit...This would especially be true in subqueries, where a small difference would be multiplied.
If it works better without the index, leave out the index.
Try DBCC FREEPROCCACHE to clear the execution plan cache as well.
This is an empty guess. Maybe if you have a lot of indexes, SQL Server is spending time on analyzing and picking one, and then rejecting all of them. If you had no indexes, the engine wouldn't have to waste it's time with this vetting process.
How long this vetting process actually takes, I have no idea.
For some queries, it is faster to read directly from the table (clustered index scan), than it is to read the index and fetch records from the table (index scan + bookmark lookup).
Consider that a record lives along with other records in a datapage. Datapage is the basic unit of IO. If the table is read directly, you could get 10 records for the cost of 1 IO. If the index is read directly, and then records are fetched from the table, you must pay 1 IO per record.
Generally SQL server is very good at picking the best way to access a table (direct vs index). There may be something in your query that is blinding the optimizer. Query hints can instruct the optimizer to use an index when it is wrong to do so. Join hints can alter the order or method of access of a table. Table Variables are considered to have 0 records by the optimizer, so if you have a large Table Variable - the optimizer may choose a bad plan.
One more thing to look out for - varchar vs nvarchar. Make sure all parameters are of the same type as the target columns. There's a case where SQL Server will convert the whole index to the parameter's type in the event of a type mismatch.
Normally SQL Server does a good job at deciding what index to use if any to retrieve the data in the fastest way. Quite often it will decide not to use any indexes as it can retrieve small amounts of data from small tables quicker without going away to the index (in some situations).
It sounds like in your case SQL may not be taking the most optimum route. Having lots of badly created indexes may be causing it to pick the wrong routes to get to the data.
I would suggest viewing the query plan in management studio to check what indexes its using, and where the time is being taken. This should give you a good idea where to start.
Another note is it maybe that these indexes have gotten fragmented over time and are now not performing to their best, it maybe worth checking this and rebuilding some of them if needed.
Check the execution plan to see if it is using one of these indexes that you "know" to be bad?
Generally, indexing slows down writing data and can help to speed up reading data.
So yes, I agree with you. It should never be slower than having no indexes at all.
SQL server actually makes some indexes for you (e.g. on primary key).
Indexes can become fragmented.
Too many indexes will always reduce performance (there are FAQs on why not to index every col in the db)
also there are some situations where indexes will always be slower.
run:
SET SHOWPLAN_ALL ON
and then run your query with and without the index usage, this will let you see what index if any are being used, where the "work" is going on etc.
No Sql Server analyzes both the indexes and the statistics before deciding to use an index to speed up a query. It is entirely possible that running a non-indexed version is faster than an indexed version.
A few things to try
ensure the indexes are created and rebuilt, and re-organized (defragmented).
ensure that the auto create statistics is turned on.
Try using Sql Profiler to capture a tuning profile and then using the Database Engine Tuning Advisor to create your indexes.
Surprisingly the MS Press Examination book for Sql administration explains indexes and statistics pretty well.
See Chapter 4 table of contents in this amazon reader preview of the book
Amazon Reader of Sql 2008 MCTS Exam Book
To me it sounds like your sql is written very poorly and thus not utilizing the indexes that you are creating.
you can add indexes till you're blue in the face but if your queries aren't optimized to use those indexes then you won't get any performance gain.
give us a sample of the queries you're using.
alright...
try this and see if you get any performance gains (with the pk indexes)
SELECT i.*
FROM Identifier i
inner join Car c
on i.CarID=c.CarID
where c.ManufactId = 14 and i.DataTypeId = 1

Why does SQL Server work faster when you index a table after filling it?

I have a sproc that puts 750K records into a temp table through a query as one of its first actions. If I create indexes on the temp table before filling it, the item takes about twice as long to run compared to when I index after filling the table. (The index is an integer in a single column, the table being indexed is just two columns each a single integer.)
This seems a little off to me, but then I don't have the firmest understanding of what goes on under the hood. Does anyone have an answer for this?
If you create a clustered index, it affects the way the data is physically ordered on the disk. It's better to add the index after the fact and let the database engine reorder the rows when it knows how the data is distributed.
For example, let's say you needed to build a brick wall with numbered bricks so that those with the highest number are at the bottom of the wall. It would be a difficult task if you were just handed the bricks in random order, one at a time - you wouldn't know which bricks were going to turn out to be the highest numbered, and you'd have to tear the wall down and rebuild it over and over. It would be a lot easier to handle that task if you had all the bricks lined up in front of you, and could organize your work.
That's how it is for the database engine - if you let it know about the whole job, it can be much more efficient than if you just feed it a row at a time.
It's because the database server has to do calculations each and every time you insert a new row. Basically, you end up reindexing the table each time. It doesn't seem like a very expensive operation, and it's not, but when you do that many of them together, you start to see the impact. That's why you usually want to index after you've populated your rows, since it will just be a one-time cost.
Think of it this way.
Given
unorderedList = {5, 1,3}
orderedList = {1,3,5}
add 2 to both lists.
unorderedList = {5, 1,3,2}
orderedList = {1,2,3,5}
What list do you think is easier to add to?
Btw ordering your input before load will give you a boost.
You should NEVER EVER create an index on an empty table if you are going to massively load it right afterwards.
Indexes have to be maintained as the data on the table changes, so imagine as if for every insert on the table the index was being recalculated (which is an expensive operation).
Load the table first and create the index after finishing with the load.
That's were the performance difference is going.
After performing large data manipulation operations, you frequently have to update the underlying indexes. You can do that by using the UPDATE STATISTICS [table] statement.
The other option is to drop and recreate the index which, if you are doing large data insertions, will likely perform the inserts much faster. You can even incorporate that into your stored procedure.
this is because if the data you insert is not in the order of the index, SQL will have to split pages to make room for additional rows to keep them together logically
This due to the fact that when SQL Server indexes table with data it is able to produce exact statistics of values in indexed column. At some moments SQL Server will recalculate statistics, but when you perform massive inserts the distribution of values may change after the statistics was calculated last time.
The fact that statistics is out of date can be discovered on Query Analyzer. When you see that on a certain table scan number of rows expected differs to much from actual numbers of rows processed.
You should use UPDATE STATISTICS to recalculate distribution of values after you insert all the data. After that no performance difference should be observed.
If you have an index on a table, as you add data to the table SQL Server will have to re-order the table to make room in the appropriate place for the new records. If you're adding a lot of data, it will have to reorder it over and over again. By creating an index only after the data is loaded, the re-order only needs to happen once.
Of course, if you are importing the records in index order it shouldn't matter so much.
In addition to the index overhead, running each query as a transaction is a bad idea for the same reason. If you run chunks of inserts (say 100) within 1 explicit transaction, you should also see a performance increase.

Resources