Suppose I have a database table with two fields, "foo" and "bar". Neither of them are unique, but each of them are indexed. However, rather than being indexed together, they each have a separate index.
Now suppose I perform a query such as SELECT * FROM sometable WHERE foo='hello' AND bar='world'; My table a huge number of rows for which foo is 'hello' and a small number of rows for which bar is 'world'.
So the most efficient thing for the database server to do under the hood is use the bar index to find all fields where bar is 'world', then return only those rows for which foo is 'hello'. This is O(n) where n is the number of rows where bar is 'world'.
However, I imagine it's possible that the process would happen in reverse, where the fo index was used and the results searched. This would be O(m) where m is the number of rows where foo is 'hello'.
So is Oracle smart enough to search efficiently here? What about other databases? Or is there some way I can tell it in my query to search in the proper order? Perhaps by putting bar='world' first in the WHERE clause?
Oracle will almost certainly use the most selective index to drive the query, and you can check that with the explain plan.
Furthermore, Oracle can combine the use of both indexes in a couple of ways -- it can convert btree indexes to bitmaps and perform a bitmap ANd operation on them, or it can perform a hash join on the rowid's returned by the two indexes.
One important consideration here might be any correlation between the values being queried. If foo='hello' accounts for 80% of values in the table and bar='world' accounts for 10%, then Oracle is going to estimate that the query will return 0.8*0.1= 8% of the table rows. However this may not be correct - the query may actually return 10% of the rwos or even 0% of the rows depending on how correlated the values are. Now, depending on the distribution of those rows throughout the table it may not be efficient to use an index to find them. You may still need to access (say) 70% or the table blocks to retrieve the required rows (google for "clustering factor"), in which case Oracle is going to perform a ful table scan if it gets the estimation correct.
In 11g you can collect multicolumn statistics to help with this situation I believe. In 9i and 10g you can use dynamic sampling to get a very good estimation of the number of rows to be retrieved.
To get the execution plan do this:
explain plan for
SELECT *
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Contrast that with:
explain plan for
SELECT /*+ dynamic_sampling(4) */
*
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Eli,
In a comment you wrote:
Unfortunately, I have a table with lots of columns each with their own index. Users can query any combination of fields, so I can't efficiently create indexes on each field combination. But if I did only have two fields needing indexes, I'd completely agree with your suggestion to use two indexes. – Eli Courtwright (Sep 29 at 15:51)
This is actually rather crucial information. Sometimes programmers outsmart themselves when asking questions. They try to distill the question down to the seminal points but quite often over simplify and miss getting the best answer.
This scenario is precisely why bitmap indexes were invented -- to handle the times when unknown groups of columns would be used in a where clause.
Just in case someone says that BMIs are for low cardinality columns only and may not apply to your case. Low is probably not as small as you think. The only real issue is concurrency of DML to the table. Must be single threaded or rare for this to work.
Yes, you can give "hints" with the query to Oracle. These hints are disguised as comments ("/* HINT */") to the database and are mainly vendor specific. So one hint for one database will not work on an other database.
I would use index hints here, the first hint for the small table. See here.
On the other hand, if you often search over these two fields, why not create an index on these two? I do not have the right syntax, but it would be something like
CREATE INDEX IX_BAR_AND_FOO on sometable(bar,foo);
This way data retrieval should be pretty fast. And in case the concatenation is unique hten you simply create a unique index which should be lightning fast.
First off, I'll assume that you are talking about nice, normal, standard b*-tree indexes. The answer for bitmap indexes is radically different. And there are lots of options for various types of indexes in Oracle that may or may not change the answer.
At a minimum, if the optimizer is able to determine the selectivity of a particular condition, it will use the more selective index (i.e. the index on bar). But if you have skewed data (there are N values in the column bar but the selectivity of any particular value is substantially more or less than 1/N of the data), you would need to have a histogram on the column in order to tell the optimizer which values are more or less likely. And if you are using bind variables (as all good OLTP developers should), depending on the Oracle version, you may have issues with bind variable peeking.
Potentially, Oracle could even do an on the fly conversion of the two b*-tree indexes to bitmaps and combine the bitmaps in order to use both indexes to find the rows it needs to retrieve. But this is a rather unusual query plan, particularly if there are only two columns where one column is highly selective.
So is Oracle smart enough to search
efficiently here?
The simple answer is "probably". There are lots'o' very bright people at each of the database vendors working on optimizing the query optimizer, so it's probably doing things that you haven't even thought of. And if you update the statistics, it'll probably do even more.
I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.
The best approach would be to add foo to bar's index, or add bar to foo's index (or both). If foo's index also contains an index on bar, that additional indexing level will not affect the utility of the foo index in any current uses of that index, nor will it appreciably affect the performance of maintaining that index, but it will give the database additional information to work with in optimizing queries such as in the example.
It's better than that.
Index Seeks are always quicker than full table scans. So behind the scenes Oracle (and SQL server for that matter) will first locate the range of rows on both indices. It will then look at which range is shorter (seeing that it's an inner join), and it will iterate the shorter range to find the matches with the larger of the two.
You can provide hints as to which index to use. I'm not familiar with Oracle, but in Mysql you can use USE|IGNORE|FORCE_INDEX (see here for more details). For best performance though you should use a combined index.
Related
If I have a large table with:
varchar foo
integer foo_id
integer other_id
varchar other_field
And I might be doing queries like:
select * from table where other_id=x
obviously I need an index on other_id to avoid a table scan.
If I'm also doing:
select * from table where other_id=x and other_field='y'
Do I want another index on other_field or is that a waste if I never do:
select * from table where other_field='y'
i.e. I only use other_field with other_id together in a query.
Would a compound index of both [other_id, other_field] be better? Or would that cause a table scan for the 1st simple query?
Use EXPLAIN and EXPLAIN ANALYZE, if you are not using these two already. Once you understand query plan basics you'll be able to optimize database queries pretty effectively.
Now to the question - saying anything without knowing a bit about the values might be misleading. If there are not that many other_field values for any specific other_id, then a simple index other_id would be enough. If there are many other_field values (i.e. thousands), I would consider making the compound index.
Do I want another index on other_field or is that a waste if I never do:
Yes, that would be very probably waste of space. Postgres is able to combine two indexes, but the conditions must be just right for that.
Would a compound index of both [other_id, other_field] be better?
Might be.
Or would that cause a table scan for the 1st simple query?
Postgres is able to use multi-column index only for the first column (not exactly true - check answer comments).
The basic rule is - get a real data set, prepare queries you are trying to optimize. Run EXPLAIN ANALYZE on those queries. Try to rewrite them (i.e. joins instead of subselects or vice versa) and check the performance (EXPLAIN ANALYZE). Try to add indexes where you feel it might help and check the performance (EXPLAIN ANALYZE)... if it does not help, don't forget to drop the unnecessary index.
And if you are still having problems and your data set is big (tens of millions+), you might need to reconsider even running specific queries. A different approach might be needed (e.g. batch / async processing) or a different technology for the specific task.
If other_id is highly selective, then you might not need an index on other_field at all. If only a few rows match other_id=x in the index, looking at each of them to see if they also match other_field=y might be fast enough to not bother with more indexes.
If it turns out that you do need to make the query faster, then you almost surely want the compound index. The stand alone index on other_field is unlikely to help.
The accepted answer is not entirely accurate - if you need all three queries mentioned in your question, then you'll actually need two indexes.
Let's see which indexes satisfy which WHERE clause in your queries:
{other_id} {other_id, other_field} {other_field, other_id} {other_field}
other_id=x yes yes no no
other_id=x and other_field='y' partially yes yes partially
other_field='y' no no yes yes
So to satisfy all 3 WHERE clauses, you'll need:
either an index on {other_id} and a composite index on {other_field, other_id}
or an index on {other_field} and a composite index on {other_id, other_field}
or a composite index on {other_id, other_field} and a composite index on {other_field, other_id}.1
Depending on distribution of your data, you could also get away with {other_id} and {other_field}, but you should measure carefully before opting for that solution. Also, you may consider replacing * with a narrower set of fields and then covering them by indexes, but that's a whole other topic...
1 "Fatter" solution than the other two - consider only if you have specific covering needs.
I have a table in a database that will be generated from the start and probably never be written to again. Even if it were ever written to, it'll be in the form of batch processes run during a release, and write time is not important at all.
It's a relatively large table with about 80k rows and maybe about 10-12 columns.
The application is likely to retrieve data from this table often.
I was thinking, since it'll never be written to again, should I just put indices on all the columns? That way it'll always be quick to read no matter what type of query I form?
Is this a good idea? Is there any downside to this I should be aware of?
My understanding is that each index does require some (a relatively small amount of) storage space. If you're tight for space this could matter. Exactly how much impact this might make may depend on which DB you are using.
It will depend on the table. If all of the columns will be used in search criteria, then it is not unreasonable to put indexes on them all. That is fairly unlikely though. Also, there may be compound (multi-column) indexes that would be more beneficial than some of the simple (single-column) indexes.
Finally, the query optimizer will have to review all the indexes that are present on the table when evaluating how answer queries. It is hard to say when this becomes a measurable overhead, but more indexes takes more time.
So, given the static nature of the table you describe, it is reasonable to index it more heavily than you might a more dynamic table. Indexing every column is probably not sensible. Choosing carefully which compound indexes to add may be important too.
Choose indexes for a table based on the queries you run against that table.
Indexes you never need for any query are just wasted space.
Individual indexes on each column isn't the full set of indexes possible. You also can make multi-column indexes (i.e. compound indexes), and these can be important for optimizing certain queries. The order of columns in a compound index matters.
SQL Server 2008 supports only 999 nonclustered indexes per table, so if you try to create all possible indexes on a table of more than a few columns, you will reach the limit.
Sorry, but you actually need to learn some things before you can optimize effectively. If it were simply a matter of indexing every column, then the RDBMS would do this by default.
What I mean is: Does a table with 20 columns benefit more from indexing a certain field (one that's used in search-ish queries) than a table that has just 4 columns?
Also: What is the harm in adding index to fields that I don't search with much, but might later in the future? Is there a negative to adding indexes? Is it just the size it takes up on disk, or can it make things run slower to add unnecessary indexes?
extracted from a comment
I'm using Postgres (latest version) and I have one table that I'll be doing a lot of LIKE type queries, etc but the values will undoubtedly change often since my clients have access to CRUD. Should I can the idea of indexes? Are they just a headache?
Does a table with 20 columns benefit more from indexing a certain field (one that's used in search-ish queries) than a table that has just 4 columns?
No, number of columns in a table has no bearing on benefits from having an index.
An index is solely on the values in the column(s) specified; it's the frequency of the values that will impact how much benefit your queries will see. For example, a column containing a boolean value is a poor choice for indexing, because it's a 50/50 chance the value will be one or the other value. At a 50/50 split over all the rows, the index doesn't narrow the search for a particular row.
What is the harm in adding index to fields that I don't search with much, but might later in the future?
Indexes only speed up data retrieval when they can be used, but they negatively impact the speed of INSERT/UPDATE/DELETE statements. Indexes also require maintenance to keep their value.
If you are doing LIKE queries you may find that indexes are not not much help anyway. While an index might improve this query ...
select * from t23
where whatever like 'SOMETHING%'
/
... it is unlikely that an index will help with either of these queries ...
select * from t23
where whatever like '%SOMETHING%'
/
select * from t23
where whatever like '%SOMETHING'
/
If you have free text fields and your users need fuzzy matching then you should look at Postgres's full text functionality. This employs the MATCH operator rather than LIKE and which requires a special index type. Find out more.
There is a gotcha, which is that full text indexes are more complicated than normal ones, and the related design decisions are not simple. Also some implementations require additional maintenance activities.
Background: I have a table with 5 million address entries which I'd like to search for different fields (customer name, contact name, zip, city, phone, ...), up to 8 fields. The data is pretty stable, maximum 50 changes a day, so almost only read access.
The user isn't supposed to tell me in advance what he's searching for, and I also want support of combined search (AND-concatenation of search terms). For example "lincoln+lond" should search for all records containing both search terms in any of the search fields, also those entries starting with any of the terms (like "London" in this example).
Problem: Now I need to choose an indexing strategy for this search table. (As a side note: I'm trying to achieve sub-second response time, worst response time should be 2 seconds.) What's better in terms of perfomance:
Do a combined index out of all queryable columns (would need 2 of them, as index limit of 900 bytes reached)
Put single indexes on each of the queryable columns
Make a fulltext index on the queryable columns and use fulltext query
I'm discarding point 1, as it doesn't seem to have any advantage (index usage will be limited and there will be no "index seek", because not all fields fit in one single index).
Question: Now, should I use the multiple single indexes variant or should I go with the fulltext index? Is there any other way to achieve the functionality mentioned above?
Try them both and see which is faster on your system. There are few hard and fast rules for database optimizations, it really depends on your environment.
Originally, i was about to suggest going with FTS as that has a lot of strong performance features going for it. Especially when you dealing with varied queries. (eg. x AND y. x NEAR y, etc..).
But before I start to ramble on with the pro's of FTS, I just checked your server version -> sql2000.
poor thing. FTS was very simple back then, so stick with multiple single indexes.
We use Sql2008 and ... it rocks.
Oh, btw. did you know that Sql2008 (free edition) has FTS in it? Is it possible to upgrade?
Going from sql2000 -> sql2008 is very worth it, if you can.
But yeah, stick with your M.S.I. option.
I agree with Grauenwolf, and I'd like to add a note about indexes. Keep in mind that if you use a syntax like the following:
SELECT field1, field2, field3
FROM table
WHERE field1 LIKE '%value%
Then no index will be used anyway when searching on field1 and you have to resort to a full-text index. For the sake of completeness, the above syntax returns all rows where field1 contains value (not necessarily at the beginning).
If you have to search for "contains", a full-text index is probably more appropriate.
To answer my own question:
I've chosen the "multiple single indexes" option. I ended having an index for each of the queried columns, each index containing only the column itself. The search works very good with mostly subsecond response times. Sometimes it takes up to 2-3 seconds, but I'm attributing it to my database server (several years old laptop with 3GB Ram and slow disk).
I didn't test the fulltext option as it was not anymore necessary (and I don't have the time to do it.)
I`ve read that columns that are chosen for indices should discriminate well among the rows, i.e. index columns should not contain a large number of rows with the same value. This would suggest that booleans or an enum such as gender would be a bad choice for an index.
But say I want to find users by gender and in my particular database, only 2% of the users are female, then in that case it seems like the gender column would be a useful index when getting the female users, but not when getting all the male users.
So would it generally be a good idea to put an index on such a column?
Indexing a low-cardinality column to improve search performance is common in my world. Oracle supports a "bitmapped index" which is designed for these situations. See this article for a short overview.
Most of my experience is with Oracle, but I assume that other RDBMS' support something similar.
Don't forget, though, that you'll probably only be selecting for females about 2% of the time. The rest of the time, you'll be searching for males. And for that, a straight table scan (rather than an index scan plus accessing the data from the table) is going to be quicker.
You can also, sometimes, use a compound index, with a low cardinality column (enum, boolean) coupled with a higher cardinality column (birth date, perhaps). This depends very much on the full data, and the queries you'll really use.
My experience is that an index on male/female is seldom going to be truly useful. And the general advice is valid. One more point to remember - indexes have to be maintained when you add or remove (or update) rows. The more indexes, the more work each modify operation has to do, slowing the system down.
There are whole books on index design.
This is a case where I would let the server statistics inform me of when to create the index. Unless you know that this query is going to predominate or that running such a query would not meet your performance goals a priori, then creating the index prematurely may just cost you performance rather than increase it. Also, you may want to think about how you would actually use the query. In this case, my guess would be that you'd typically be doing some sort of aggregation based on this column rather than simply selecting the users who meet the criteria. In that event, you'll be doing the table scan anyway and the index won't buy you anything.