SQL Server - can a fragmented index be worse than no index - sql-server

If my index is say 80% fragmented and is used in joins can the overall performance be worse than if that index didn't exist? And if so, why?

Your question is too vague to answer consistently, or even to know what you're actually after, but consider this:
A fragmented index means you'll have a lot of actual disk activity compared to the amount of disk activity you'd need for a certain query.
Take a look at DBCC SHOWCONTIG
Among other useful information, it shows you a figure for Scan Density. A very low "hit rate" on this can imply that you're doing heaps more IO than you'd need to with a properly maintained index. This could even exceed the amount of IO you'd need to perform a table scan, but it all depends on the size of your objects and your data access pattern.
One area where a poorly maintained (= highly fragmented) index will hurt you double, is that it hurts performance in inserts, updates AND selects.
With this in mind, it's a pretty common practice for ETL processes to drop indexes before and recreating them after processing large batches of information. In the mean time, they'd only hurt write performance and be too far fragmented to help lookups.
Besides that: it's easy to do index maintenance. I'd recommend deploying Ola Hallengren's index maintenance solution and no longer worry about it.

Related

What are some best practices and "rules of thumb" for creating database indexes?

I have an app, which cycles through a huge number of records in a database table and performs a number of SQL and .Net operations on records within that database (currently I am using Castle.ActiveRecord on PostgreSQL).
I added some basic btree indexes on a couple of the feilds, and as you would expect, the performance of the SQL operations increased substantially. Wanting to make the most of dbms performance I want to make some better educated choices about what I should index on all my projects.
I understand that there is a detrement to performance when doing inserts (as the database needs to update the index, as well as the data), but what suggestions and best practices should I consider with creating database indexes? How do I best select the feilds/combination of fields for a set of database indexes (rules of thumb)?
Also, how do I best select which index to use as a clustered index? And when it comes to the access method, under what conditions should I use a btree over a hash or a gist or a gin (what are they anyway?).
Some of my rules of thumb:
Index ALL primary keys (I think most RDBMS do this when the table is created).
Index ALL foreign key columns.
Create more indexes ONLY if:
Queries are slow.
You know the data volume is going to increase significantly.
Run statistics when populating a lot of data in tables.
If a query is slow, look at the execution plan and:
If the query for a table only uses a few columns, put all those columns into an index, then you can help the RDBMS to only use the index.
Don't waste resources indexing tiny tables (hundreds of records).
Index multiple columns in order from high cardinality to less. This means: first index the columns with more distinct values, followed by columns with fewer distinct values.
If a query needs to access more than 10% of the data, a full scan is normally better than an index.
Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),
As #David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
Also, make sure that a scheduled index maintenance plan is also created.

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

SQL Server 2005 - Rowsize effect on query performance?

Im trying to squeeze some extra performance from searching through a table with many rows.
My current reasoning is that if I can throw away some of the seldom used member from the searched table thereby reducing rowsize the amount of pagesplits and hence IO should drop giving a benefit when data start to spill from memory.
Any good resource detailing such effects?
Any experiences?
Thanks.
Tuning the size of a row is only a major issue if the RDBMS is performing a full table scan of the row, if your query can select the rows using only indexes then the row size is less important (unless you are returning a very large number of rows where the IO of returning the actual result is significant).
If you are doing a full table scan or partial scans of large numbers of rows because you have predicates that are not using indexes then rowsize can be a major factor. One example I remember, On a table of the order of 100,000,000 rows splitting the largish 'data' columns into a different table from the columns used for querying resulted in an order of magnitude performance improvement on some queries.
I would only expect this to be a major factor in a relatively small number of situations.
I don't now what else you tried to increase performance, this seems like grasping at straws to me. That doesn't mean that it isn't a valid approach. From my experience the benefit can be significant. It's just that it's usually dwarfed by other kinds of optimization.
However, what you are looking for are iostatistics. There are several methods to gather them. A quite good introduction can be found ->here.
The sql server query plan optimizer is a very complex algorithm and decision what index to use or what type of scan depends on many factors like query output columns, indexes available, statistics available, statistic distribution of you data values in the columns, row count, and row size.
So the only valid answer to your question is: It depends :)
Give some more information like what kind of optimization you have already done, what does the query plan looks like, etc.
Of cause, when sql server decides to do a table scna (clustered index scan if available), you can reduce io-performance by downsize row size. But in that case you would increase performance dramatically by creating a adequate index (which is a defacto a separate table with smaller row size).
If the application is transactional then look at the indexes in use on the table. Table partitioning is unlikely to be much help in this situation.
If you have something like a data warehouse and are doing aggregate queries over a lot of data then you might get some mileage from partitioning.
If you are doing a join between two large tables that are not in a 1:M relationship the query optimiser may have to resolve the predicates on each table separately and then combine relatively large intermediate result sets or run a slow operator like nested loops matching one side of the join. In this case you may get a benefit from a trigger-maintained denormalised table to do the searches. I've seen good results obtained from denormalised search tables for complex screens on a couple of large applications.
If you're interested in minimizing IO in reading data you need to check if indexes are covering the query or not. To minimize IO you should select column that are included in the index or indexes that cover all columns used in the query, this way the optimizer will read data from indexes and will never read data from actual table rows.
If you're looking into this kind of details maybe you should consider upgrading HW, changing controllers or adding more disk to have more disk spindle available for the query processor and so allowing SQL to read more data at the same time
SQL Server disk I/O is frequently the cause of bottlenecks in most systems. The I/O subsystem includes disks, disk controller cards, and the system bus. If disk I/O is consistently high, consider:
Move some database files to an additional disk or server.
Use a faster disk drive or a redundant array of inexpensive disks (RAID) device.
Add additional disks to a RAID array, if one already is being used.
Tune your application or database to reduce disk access operations.
Consider index coverage, better indexes, and/or normalization.
Microsoft SQL Server uses Microsoft Windows I/O calls to perform disk reads and writes. SQL Server manages when and how disk I/O is performed, but the Windows operating system performs the underlying I/O operations. Applications and systems that are I/O-bound may keep the disk constantly active.
Different disk controllers and drivers use different amounts of CPU time to perform disk I/O. Efficient controllers and drivers use less time, leaving more processing time available for user applications and increasing overall throughput.
First thing I would do is ensure that your indexes have been rebuilt; if you are dealing with huge amount of data and an index rebuild is not possible (if SQL server 2005 onwards you can perform online rebuilds without locking everyone out), then ensure that your statistics are up to date (more on this later).
If your database contains representative data, then you can perform a simple measurement of the number of reads (logical and physical) that your query is using by doing the following:
SET STATISTICS IO ON
GO
-- Execute your query here
SET STATISTICS IO OFF
GO
On a well setup database server, there should be little or no physical reads (high physical reads often indicates that your server needs more RAM). How many logical reads are you doing? If this number is high, then you will need to look at creating indexes. The next step is to run the query and turn on the estimated execution plan, then rerun (clearing the cache first) displaying the actual execution plan. If these differ, then your statistics are out of date.
I think you're going to be farther ahead using standard optimization techniques first -- check your execution plan, profiler trace, etc. and see whether you need to adjust your indexes, create statistics etc. -- before looking at the physical structure of your table.

indexing large table in SQL SERVER

I have a large table (more than 10 millions records). this table is heavily used for search in the application. So, I had to create indexes on the table. However ,I experience a slow performance when a record is inserted or updated in the table. This is most likely because of the re-calculation of indexes.
Is there a way to improve this.
Thanks in advance
You could try reducing the number of page splits (in your indexes) by reducing the fill factor on your indexes. By default, the fill factor is 0 (same as 100 percent). So, when you rebuild your indexes, the pages are completely filled. This works great for tables that are not modified (insert/update/delete). However, when data is modified, the indexes need to change. With a Fill Factor of 0, you are guaranteed to get page splits.
By modifying the fill factor, you should get better performance on inserts and updates because the page won't ALWAYS have to split. I recommend you rebuild your indexes with a Fill Factor = 90. This will leave 10% of the page empty, which would cause less page splits and therefore less I/O. It's possible that 90 is not the optimal value to use, so there may be some 'trial and error' involved here.
By using a different value for fill factor, your select queries may become slightly slower, but with a 90% fill factor, you probably won't notice it too much.
There are a number of solutions you could choose
a) You could partition the table
b) Consider performing updates in batch at offpeak hours (like at night)
c) Since engineering is a balancing act of trade-offs, you would have to choose which is more important (Select or Update/insert/delete) and which operation is more important. Assuming you don't need the results in real time for an "insert", you can use Sql server service broker for those operations to perform the "less important" operation asynchronously
http://msdn.microsoft.com/en-us/library/ms166043(SQL.90).aspx
Thanks
-RVZ
We'd need to see your indexes, but likely yes.
Some things to keep in mind are that you don't want to just put an index on every column, and you don't generally want just one column per index.
The best thing to do is if you can log some actual searches, track what users are actually searching for, and create indexes targeted at those specific types of searches.
This is a classic engineering trade-off... you can make the shovel lighter or stronger, never both... (until a breakthrough in material science. )
More index mean more DML maintenance, means faster queries.
Fewer indexes mean less DML maintenance, means slower queries.
It could be possible that some of your indexes are redundant and could be combined.
Besides what Joel wrote, you also need to define the SLA for DML. Is it ok that it's slower, you noticed that it slowed down but does it really matter vs. the query improvement you've achieved... IOW, is it ok to have a light shovel that weak?
If you have a clustered index that is not on an identity numeric field, rearranging the pages can slow you down. If this is the case for you see if you can improve speed by making this a non-clustered index (faster than no index but tends to be a bit slower than the clustered index, so your selects may slow down but inserts improve)
I have found users more willing to tolerate a slightly slower insert or update to a slower select. That is a general rule, but if it becomes unacceptably slow (or worse times out) well no one can deal with that well.

How many database indexes is too many?

I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.

Resources