MS-SQL 2012 - indexing Bit field - sql-server

On an MS-SQL 2012, does it makes sense to index a "Deleted" BIT field if one is going to always use it on the Queries (ie. SELECT xx FROM oo WHERE Deleted = 0)
Or does the fact that a field is BIT, already comes with some sort of auto index for performance issues??

When you index a bit field which consist of either 1,0 or some limited values, you are actually reducing the number of rows matching that value. For fewer records this may work well but for large number of data it may help you in performance gain.
You can include bit columns as part of compound index
Index on bit field could be really helpful in scenarios where there is a large discrepancy between the number of 0's and 1's, and you are searching for the the smaller of the two.

indexing a bit field will be pretty useless, under must conditions, because the selectivity is so low. An index scan on a large table is not going to be better than a table scan. If there are other conditions you can use to create filtered indices you could consider that.
If this field is changing the nature of the logic in such a way that you will always need to consider it in the predicate, you might consider splitting the data into other tables when reporting.

Whether to index a bit field depends on several factors which have been adequately explained in the answer to this question. Link to 231125

As others have mentioned, selectivity is the key. However, if you're always searching on one value or another and that value is highly selective, consider using a filtered index.

Why not put out on the front of your clustered index? If deletes are incremental, you'd have to turn your fill factor down, but they're probably daily, right? And you have way more deleted records than undeleted records? And, as you say, you only ever query undeleted records. So, yes. Don't just index that column. Cluster on it.

It can be useful as a part of composite index, when the bit-column is at the first position in the index. But if you suppose to use it only for selectitn one value (select .... where deleted=1 and another_key=?; but never deleted=0) then create index on another_key with filter:
create index i_another on t(another_key) where deleted=1
If the bit-column should be the last in the composite index then the occurrence in index is useless. However You can include it for better performace:
create index i_another on t(another_key) include(deleted)
Then the DB engine gets the value along with reading index and doesn't need to pick up it from base table page.

Related

SQL Server not using proper index for query

I have a table on SQL Server with about 10 million rows. It has a nonclustered index ClearingInfo_idx which looks like:
I am running query which isn't using ClearingInfo_idx index and execution plan looks like this:
Can anyone explain why query optimizer chooses to scan clustered index ?
I think it suggests this index because you use a sharp search for the two columns immediate and clearingOrder_clearingOrderId. Those values are numbers, which were good to search. The column status is nvarchar which isn't the best for a search, and due to your search with in, SQL Server needs to search two of those values.
SQL Server would use the two number columns to get a faster result and searching in the status in the second round after the number of possible results is reduced due to the exact search on the two number columns.
Hopefully you get my opinion. :-) Otherwise, just ask again. :-)
As Luaan already pointed out, the likely reason the system prefers to scan the clustered index is because
you're asking for all fields to be returned (SELECT *), change this to fields that are present in the index ( = index fields + clustered index-fields) and you'll probably see it using just the index. If you'd need a couple of extra fields you can consider INCLUDEing those in the index.
the order of the index fields isn't very optimal. Additionally it might well be that the 'content' of the field isn't very helpful either. How many distinct values are present in the index-columns and how are they spread around? If you're WHERE covers 90% of the records there is very little reason to first create a (huge) list of keys and then go fetch those from the clustered index later on. Scanning the latter directly then makes much more sense.
Did you try the suggested index? Not sure what other queries run on the table, but for this particular query it seems like a valid replacement to me. If the replacement will satisfy the other queries is another question off course. Adding extra indexes might negatively impact your IUD operations and it will require more disk-space; there is no such thing as a free lunch =)
That said, if performance is an issue, have you considered a filtered index? (again, no such thing as a free lunch; it's all about priorities)

What is the right way to index a postgres table when doing a query with two fields?

If I have a large table with:
varchar foo
integer foo_id
integer other_id
varchar other_field
And I might be doing queries like:
select * from table where other_id=x
obviously I need an index on other_id to avoid a table scan.
If I'm also doing:
select * from table where other_id=x and other_field='y'
Do I want another index on other_field or is that a waste if I never do:
select * from table where other_field='y'
i.e. I only use other_field with other_id together in a query.
Would a compound index of both [other_id, other_field] be better? Or would that cause a table scan for the 1st simple query?
Use EXPLAIN and EXPLAIN ANALYZE, if you are not using these two already. Once you understand query plan basics you'll be able to optimize database queries pretty effectively.
Now to the question - saying anything without knowing a bit about the values might be misleading. If there are not that many other_field values for any specific other_id, then a simple index other_id would be enough. If there are many other_field values (i.e. thousands), I would consider making the compound index.
Do I want another index on other_field or is that a waste if I never do:
Yes, that would be very probably waste of space. Postgres is able to combine two indexes, but the conditions must be just right for that.
Would a compound index of both [other_id, other_field] be better?
Might be.
Or would that cause a table scan for the 1st simple query?
Postgres is able to use multi-column index only for the first column (not exactly true - check answer comments).
The basic rule is - get a real data set, prepare queries you are trying to optimize. Run EXPLAIN ANALYZE on those queries. Try to rewrite them (i.e. joins instead of subselects or vice versa) and check the performance (EXPLAIN ANALYZE). Try to add indexes where you feel it might help and check the performance (EXPLAIN ANALYZE)... if it does not help, don't forget to drop the unnecessary index.
And if you are still having problems and your data set is big (tens of millions+), you might need to reconsider even running specific queries. A different approach might be needed (e.g. batch / async processing) or a different technology for the specific task.
If other_id is highly selective, then you might not need an index on other_field at all. If only a few rows match other_id=x in the index, looking at each of them to see if they also match other_field=y might be fast enough to not bother with more indexes.
If it turns out that you do need to make the query faster, then you almost surely want the compound index. The stand alone index on other_field is unlikely to help.
The accepted answer is not entirely accurate - if you need all three queries mentioned in your question, then you'll actually need two indexes.
Let's see which indexes satisfy which WHERE clause in your queries:
{other_id} {other_id, other_field} {other_field, other_id} {other_field}
other_id=x yes yes no no
other_id=x and other_field='y' partially yes yes partially
other_field='y' no no yes yes
So to satisfy all 3 WHERE clauses, you'll need:
either an index on {other_id} and a composite index on {other_field, other_id}
or an index on {other_field} and a composite index on {other_id, other_field}
or a composite index on {other_id, other_field} and a composite index on {other_field, other_id}.1
Depending on distribution of your data, you could also get away with {other_id} and {other_field}, but you should measure carefully before opting for that solution. Also, you may consider replacing * with a narrower set of fields and then covering them by indexes, but that's a whole other topic...
1 "Fatter" solution than the other two - consider only if you have specific covering needs.

Indexes with columns included

I noticed in my database that I have a lot of indexes in a table that only differ in the included columns.
For example for table A I have this:
INDEX ON A(COLUMN_A) INCLUDE (COLUMN_B)
INDEX ON A(COLUMN_A) INCLUDE (COLUMN_C)
INDEX ON A(COLUMN_A) INCLUDE (COLUMN_D)
It seems to me it would be more efficient (for inserts/updates/deletes) to just have this:
INDEX ON A(COLUMN_A) INCLUDE (COLUMN_B, COLUMN_C, COLUMN_D)
Would there be any reason not to do this?
Thanks!
You are right, they should be combined into one. The non-key columns that are INCLUDE-ed are just stored in the leaf nodes of the index so they can be read. Unlike key columns they don't form part of the hierarchy of the index so the order isn't important. Having fewer indexes is a good thing if the indexes aren't adding anything useful, as in this case with your redundant indexes.
See also
https://stackoverflow.com/a/1308012/8479
http://msdn.microsoft.com/en-us/library/ms190806.aspx
Most likely, this is a read optimization, not write. If there are multiple queries that use the same key column (COLUMN_A), but using a different "data" column (COLUMN_B/C/D), using this kind of index saves some I/O (no need to load unnecessary data from the table, you already have it along with the index). Including all the data columns in one index would take less space overall (no need to have the key column saved three timse), but each of the indices in your case is smaller than one "combined" index.
Hopefully, the indices were created this way for a reason, based on performance profiling and real performance issues. This is where documentation with motivation comes in very handy - why were the indices created this way? Was it sloppiness? Little understanding of the way indices work? Automatic optimization based on query plan analysis? Unless you know that, it might be a bad idea to tweak this kind of thing, especially if you don't actually have a performance problem.

should nearly unique fields have indexes

I have a field in a database that is nearly unique: 98% of the time the values will be unique, but it may have a few duplicates. I won't be doing many searches on this field; say twice a month. The table currently has ~5000 records and will gain about 150 per month.
Should this field have an index?
I am using MySQL.
I think the 'nearly unique' is probably a red herring. The data is either unique, or it's not, but that doesn't determine whether you would want to index it for performance reasons.
Answer:
5000 records is really not many at all, and regardless of whether you have an index, searches will still be fast. At that rate of inserts, it'll take you 3 years to get to 10000 records, which is still also not many.
I personally wouldn't bother with adding an index, but it wouldn't matter if you did.
Explanation:
What you have to think about when deciding to add an index is the trade-off between insertion speed, and selection speed.
Without an index, doing a select on that field means MySQL has to walk over every single row and read every single field. Adding an index prevents this.
The downside of the index is that each time data gets inserted, the DB has to update the index in addition to adding the data. This is usually a small overhead, but you'd really notice it if you had loads of indexes, and were doing a lot of writes.
By the time you get this many rows in your database, you'd want an index anyway as otherwise your selects would take all day, but it's just something to be aware about so that you don't end up adding indexes on fields "just in case I need it"
That's not very many records at all; I wouldn't bother making any indexes on that table. The relative uniqueness of the field is irrelevant - even on years-old commodity hardware I'd expect a query on that table to take a fraction of a second.
you can use the general rule of thumb: optimize when it becomes a problem. Just don't use an index until you notice you need one.
From what you say, it doesn't sound like an index is necessary. Rule of thumb is index fields that are being used in SELECTS a lot to speed up the searching, which in turn (can) slows down INSERTS and UPDATES.
On a recordset as small as yours, I don't think you will see much of a real world hit either way.
If you'll only be doing searches on it twice a month and its that few rows then I would say don't index it. Its all but useless.
No. There aren't many records and it's not going to be frequently queried. No need to index.
It's really a judgement call. With such a small table you can search reasonably quickly without an index, so you could get by without it.
On the other hand, the cost of creating an index you don't really need is pretty low, so you're not saving yourself much by not doing it.
Also, if you do create the index, you're covered for the future if you suddenly start getting 1000 new records/week. Possibly you know enough about the situation to say for certain that that will never happen, but requirements do have a way of changing when you least expect.
EDIT: As far as changing requirements, the thing to consider is this: If the DB does grow and you find out later that you do need an index, can you simply create the index and be done? Or will you also need to change lots of code to make use of the new index?
It depends. As others have responded, there's a trade off between table update speed and selection speed. Table update includes inserts, updates, and deletes on the table.
One question you didn't address. Does the table have a primary key, and a corresponding index? A table with no indexes usually benefits form having at least one index. The most common way of getting that index is to declare a primary key, and rely on the DBMS to generate an index accordingly.
If a table has no candidates for primary key, that usually indicates a serious flaw in table design. That's a separate issue and should get a spearate discussion.

Can Multiple Indexes Work Together?

Suppose I have a database table with two fields, "foo" and "bar". Neither of them are unique, but each of them are indexed. However, rather than being indexed together, they each have a separate index.
Now suppose I perform a query such as SELECT * FROM sometable WHERE foo='hello' AND bar='world'; My table a huge number of rows for which foo is 'hello' and a small number of rows for which bar is 'world'.
So the most efficient thing for the database server to do under the hood is use the bar index to find all fields where bar is 'world', then return only those rows for which foo is 'hello'. This is O(n) where n is the number of rows where bar is 'world'.
However, I imagine it's possible that the process would happen in reverse, where the fo index was used and the results searched. This would be O(m) where m is the number of rows where foo is 'hello'.
So is Oracle smart enough to search efficiently here? What about other databases? Or is there some way I can tell it in my query to search in the proper order? Perhaps by putting bar='world' first in the WHERE clause?
Oracle will almost certainly use the most selective index to drive the query, and you can check that with the explain plan.
Furthermore, Oracle can combine the use of both indexes in a couple of ways -- it can convert btree indexes to bitmaps and perform a bitmap ANd operation on them, or it can perform a hash join on the rowid's returned by the two indexes.
One important consideration here might be any correlation between the values being queried. If foo='hello' accounts for 80% of values in the table and bar='world' accounts for 10%, then Oracle is going to estimate that the query will return 0.8*0.1= 8% of the table rows. However this may not be correct - the query may actually return 10% of the rwos or even 0% of the rows depending on how correlated the values are. Now, depending on the distribution of those rows throughout the table it may not be efficient to use an index to find them. You may still need to access (say) 70% or the table blocks to retrieve the required rows (google for "clustering factor"), in which case Oracle is going to perform a ful table scan if it gets the estimation correct.
In 11g you can collect multicolumn statistics to help with this situation I believe. In 9i and 10g you can use dynamic sampling to get a very good estimation of the number of rows to be retrieved.
To get the execution plan do this:
explain plan for
SELECT *
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Contrast that with:
explain plan for
SELECT /*+ dynamic_sampling(4) */
*
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Eli,
In a comment you wrote:
Unfortunately, I have a table with lots of columns each with their own index. Users can query any combination of fields, so I can't efficiently create indexes on each field combination. But if I did only have two fields needing indexes, I'd completely agree with your suggestion to use two indexes. – Eli Courtwright (Sep 29 at 15:51)
This is actually rather crucial information. Sometimes programmers outsmart themselves when asking questions. They try to distill the question down to the seminal points but quite often over simplify and miss getting the best answer.
This scenario is precisely why bitmap indexes were invented -- to handle the times when unknown groups of columns would be used in a where clause.
Just in case someone says that BMIs are for low cardinality columns only and may not apply to your case. Low is probably not as small as you think. The only real issue is concurrency of DML to the table. Must be single threaded or rare for this to work.
Yes, you can give "hints" with the query to Oracle. These hints are disguised as comments ("/* HINT */") to the database and are mainly vendor specific. So one hint for one database will not work on an other database.
I would use index hints here, the first hint for the small table. See here.
On the other hand, if you often search over these two fields, why not create an index on these two? I do not have the right syntax, but it would be something like
CREATE INDEX IX_BAR_AND_FOO on sometable(bar,foo);
This way data retrieval should be pretty fast. And in case the concatenation is unique hten you simply create a unique index which should be lightning fast.
First off, I'll assume that you are talking about nice, normal, standard b*-tree indexes. The answer for bitmap indexes is radically different. And there are lots of options for various types of indexes in Oracle that may or may not change the answer.
At a minimum, if the optimizer is able to determine the selectivity of a particular condition, it will use the more selective index (i.e. the index on bar). But if you have skewed data (there are N values in the column bar but the selectivity of any particular value is substantially more or less than 1/N of the data), you would need to have a histogram on the column in order to tell the optimizer which values are more or less likely. And if you are using bind variables (as all good OLTP developers should), depending on the Oracle version, you may have issues with bind variable peeking.
Potentially, Oracle could even do an on the fly conversion of the two b*-tree indexes to bitmaps and combine the bitmaps in order to use both indexes to find the rows it needs to retrieve. But this is a rather unusual query plan, particularly if there are only two columns where one column is highly selective.
So is Oracle smart enough to search
efficiently here?
The simple answer is "probably". There are lots'o' very bright people at each of the database vendors working on optimizing the query optimizer, so it's probably doing things that you haven't even thought of. And if you update the statistics, it'll probably do even more.
I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.
The best approach would be to add foo to bar's index, or add bar to foo's index (or both). If foo's index also contains an index on bar, that additional indexing level will not affect the utility of the foo index in any current uses of that index, nor will it appreciably affect the performance of maintaining that index, but it will give the database additional information to work with in optimizing queries such as in the example.
It's better than that.
Index Seeks are always quicker than full table scans. So behind the scenes Oracle (and SQL server for that matter) will first locate the range of rows on both indices. It will then look at which range is shorter (seeing that it's an inner join), and it will iterate the shorter range to find the matches with the larger of the two.
You can provide hints as to which index to use. I'm not familiar with Oracle, but in Mysql you can use USE|IGNORE|FORCE_INDEX (see here for more details). For best performance though you should use a combined index.

Resources