I think it will give you a better understanding about where I'm coming from by letting you know how I understand how Btree indices work fundamentally. I'm not a DBA and I'm asking this question as a layman with basic understanding of data structures.
The basic idea of an index is that it speeds up searches by skipping significant amount of records when searching through a database.
AFAIK, binary tree data structure, which I presume where Btree indices are based on, helps us to search without scanning the entire database by dividing the data into nodes. For oversimplified example, words that start from A to M are stored in left node, and words that start with N to Z are stored in right node on the first level of the tree. In this case when we search for the word "Jackfruit" it will only search on the left node skipping the right node saving us significant amount of time and IO.
In this sense, how does a bitmap index let us not scan the entire database when searching? If not, how does it speed up searches? Or is it just meant for compression?
Image taken from here
The image above is a conceptual illustration of a bitmap. Using that structure, how does a DB find rows? Does it scan all rows? In binary tree, that fact that you don't have to scan everything is exactly how it helps speed up the search. I can't see any explanation how exactly a DB gets an advantage in searching for rows using bitmap other than the fact that bitmap takes less space.
Btree indexes are good for key searching (duplicates allowed, but mainly distinct values in the column, ie. SSN). Bitmap indexes are better in cases when you have a few distinct values like 'sex', 'state', 'color', and so on.
Oracle bitmap indexes are very different from standard b-tree indexes. In bitmap structures, a two-dimensional array is created with one column for every row in the table being indexed. Each column represents a distinct value within the bitmapped index. This two-dimensional array represents each value within the index multiplied by the number of rows in the table.
Please see http://www.dba-oracle.com/oracle_tips_bitmapped_indexes.htm for a more detailed explanation.
Related
On an MS-SQL 2012, does it makes sense to index a "Deleted" BIT field if one is going to always use it on the Queries (ie. SELECT xx FROM oo WHERE Deleted = 0)
Or does the fact that a field is BIT, already comes with some sort of auto index for performance issues??
When you index a bit field which consist of either 1,0 or some limited values, you are actually reducing the number of rows matching that value. For fewer records this may work well but for large number of data it may help you in performance gain.
You can include bit columns as part of compound index
Index on bit field could be really helpful in scenarios where there is a large discrepancy between the number of 0's and 1's, and you are searching for the the smaller of the two.
indexing a bit field will be pretty useless, under must conditions, because the selectivity is so low. An index scan on a large table is not going to be better than a table scan. If there are other conditions you can use to create filtered indices you could consider that.
If this field is changing the nature of the logic in such a way that you will always need to consider it in the predicate, you might consider splitting the data into other tables when reporting.
Whether to index a bit field depends on several factors which have been adequately explained in the answer to this question. Link to 231125
As others have mentioned, selectivity is the key. However, if you're always searching on one value or another and that value is highly selective, consider using a filtered index.
Why not put out on the front of your clustered index? If deletes are incremental, you'd have to turn your fill factor down, but they're probably daily, right? And you have way more deleted records than undeleted records? And, as you say, you only ever query undeleted records. So, yes. Don't just index that column. Cluster on it.
It can be useful as a part of composite index, when the bit-column is at the first position in the index. But if you suppose to use it only for selectitn one value (select .... where deleted=1 and another_key=?; but never deleted=0) then create index on another_key with filter:
create index i_another on t(another_key) where deleted=1
If the bit-column should be the last in the composite index then the occurrence in index is useless. However You can include it for better performace:
create index i_another on t(another_key) include(deleted)
Then the DB engine gets the value along with reading index and doesn't need to pick up it from base table page.
I was debating between using BTREE index or HASH index.
Theoretically, what are the advantages of using HASH indexes?
When should they be chosen and more importantly, why?
I have read that hash indexes are good for point queries, but WHY?
I already know that BTREE indexes are best for range queries because you can easily traverse through the leaf nodes by going from left to right.
You don't mention a specific DBMS so this answer is pretty generic.
A properly performing hash index should reach the answer to a point query in a single fetch. A B-Tree will use something like lg_B(n) secondary storage accesses where B is the approximate branch factor and n is the number of entries. Caching and reasonable node sizes will likely keep that to a couple of fetches but still twice that for the hash index. In addition, each B-Tree access has non-trivial computations associated with it in order to traverse the sub-index in each node (something like lg_2(B) data comparison operations per node). The computation time for a hash index is usually very limited (a hash computation and a small number of data comparison operations - hopefully one). The computation time for searching within each node is often significant for B-Tree based indices.
In terms of picking, use a hash index if
you only expect point queries
you don't expect the data to fall into any poorly performing cases for the system hash function (oddball case but thought I should mention it)
B-Tree family are better if you have any kind of range query and/or want sorted results on a pre-determinable set of columns.
I'm new to databases and have been reading that adding an index to a field you need to search over can dramatically speed up search times. I understand this reality, but am curious as to how it actually works. I've searched a bit on the subject, but haven't found any good, concise, and not over technical answer to how it works.
I've read the analogy of it being like an index at the back of a book, but in the case of a data field of unique elements (such as e-mail addresses in a user database), using the back of the book analogy would provide the same linear look up time as a non-indexed seach.
What is going on here to speed up search times so much? I've read a little bit about searching using B+-Trees, but the descriptions were a bit too indepth. What I'm looking for is a high level overview of what is going on, something to help my conceptual understanding of it, not technical details.
Expanding on the search algorithm efficiencies, a key area in database performance is how fast the data can be accessed.
In general, reading data from a disk is a lot lot slower than reading data from memory.
To illustrate a point, lets assume everything is stored on disk. If you need to search through every row of data in a table looking for certain values in a field, you still need to read the entire row of data from the disk to see if it matches - this is commonly referred to as a 'table scan'.
If your table is 100MB, that's 100MB you need to read from disk.
If you now index the column you want to search on, in simplistic terms the index will store each unique value of the data and a reference to the exact location of the corresponding full row of data. This index may now only be 10MB compared to 100MB for the entire table.
Reading 10MB of data from the disk (and maybe a bit extra to read the full row data for each match) is roughly 10 times faster than reading the 100MB.
Different databases will store indexes or data in memory in different ways to make these things much faster. However, if your data set is large and doesn't fit in memory then the disk speed can have a huge impact and indexing can show huge gains.
In memory there can still be large performance gains (amongst other efficiencies).
In general, that's why you may not notice any tangible difference with indexing a small dataset which easily fits in memory.
The underlying details will vary between systems and actually will be a lot more complicated, but I've always found the disk reads vs. memory reads an easily understandable way of explaining this.
Okay, after a bit of research and discussion, here is what I have learned:
Conceptually an index is a sorted copy of the data field it is indexing, where each index value points to it's original(unsorted) row. Because the database knows how values are sorted, it can apply more sophisticated search algorithms than just looking for the value from start to finish. The binary search algorithm is a simple example of a searching algorithm for sorted lists and reduces the maximum search time from O(n) to O(log n).
As a side note: A decent sorting algorithm generally will take O(n log n) to complete, which means (as we've all probably heard before) you should only put indexes on fields you will search often, as it's a bit more expensive to add the index (which includes a sort) than it is to do a full search a few times. For example, in a large database of >1,000,000 entries it's on the range of 20x more expensive to sort than to search once.
Edit:
See #Jarod Elliott's answer for a more in-depth look at search efficiencies, specifically in regard to read from disk operations.
To continue your back-of-the-book analogy, if the pages were in order by that element it would be the same look-up time as a non-indexed search, yes.
However, what if your book were a list of book reviews ordered by author, but you only knew the ISBN. The ISBN is unique, yes, but you'd still have to scan each review to find the one you are looking for.
Now, add an index at the back of the book, sorted by ISBN. Boom, fast search time. This is analogous to the database index, going from the index key (ISBN) to the actual data row (in this case a page number of your book).
Can anyone explain why databases tend to use b-tree indexes rather than a linked list of ordered elements.
My thinking is this: On a B+ Tree (used by most databases), the none-leaf nodes are a collection of pointers to other nodes. Each collection (node) is a ordered list. The leaf nodes, which is where all the data pointers are, is a linked list of clusters of data pointers.
The non-leaf nodes are just used to find the correct leaf node in which your target data pointer lives. So as the leaf nodes are just like a linked list, then why not just do away with the tree elements and just have the linked list. Meta data can be provided which gives the minimum and maximum value of each leaf node cluster, so the application can just read the meta data and find the correct leaf where the data pointer lives.
Just to be clear that the most efficent algorithm for searching an random accessed ordered list is an binary search which has a performance of O(log n) which is the same as a b-tree. The benifit of using a linked list rather than a tree is that they don't need to be ballanced.
Is this structure feasible.
After some research and paper reading I found the answer.
In order to cope with large amounts of data such a millions of records, indexes have to be organised into clusters. A cluster is a continuous group of sectors on a disk that can be read into memory quickly. These are usually about 4096 bytes long.
Each one of these clusters can contain a bunch of indexes which can point to other clusters or data on a disk. So if we had a linked list index, each element of the index would be made up of the collection of indexes contained in a single cluster (say 100).
So, when we are looking for a specific record, how do we know which cluster it is on. We perform a binary search to find the cluster in question [O(log n)].
However, to do a binary search we need to know where the range of values in each clusters, so we need meta-data that says the min and max value of each cluster and where that cluster is. This is great. Except if each cluster can contain 100 indexes, and our meta data is also held on a single cluster (for speed) , then our meta data can only point to 100 clusters.
What happens if we want more than 100 clusters. We have to have two meta-data indexes, each pointing to 100 clusters (10 000 records). Well that’s not enough. Lets add another meta-data cluster and we can now access 1 000 000 records. So how do we know which one of the three meta-data clusters we need to query in order to find our target data cluster. We could search one then the other, but that doesn’t scale. So I add another meta-meta-data cluster to indicate which one of the three meta-data clusters I should query to find the target data cluster. Now I have a tree!
So that’s why databases use trees. It’s not the speed it’s the size of the indexes and the need to have indexes referencing other indexes. What I have described above is a B+Tree – child nodes contain references to other child nodes or leaf nodes, and leaf nodes contain references to data on disk.
Phew!
I guess I answered that question in Chapter 1 of my SQL Indexing Tutorial: http://use-the-index-luke.com/sql/anatomy
To summarize the most important parts, with respect to your particular question:
-- from "The Leaf Nodes"
The primary purpose of an index is to provide an ordered
representation of the indexed data. It is, however, not possible to
store the data sequentially because an insert statement would need to
move the following entries to make room for the new one. But moving
large amounts of data is very time-consuming, so that the insert
statement would be very slow. The problem's solution is to establish a
logical order that is independent of physical order in memory.
-- from "The B-Tree":
The index leaf nodes are stored in an arbitrary order—the position on
the disk does not correspond to the logical position according to the
index order. It is like a telephone directory with shuffled pages. If
you search for “Smith” in but open it at “Robinson” in the first
place, it is by no means granted that Smith comes farther back.
Databases need a second structure to quickly find the entry among the
shuffled pages: a balanced search tree—in short: B-Tree.
Linked lists are usually not ordered by key value, but by the moment of insertion: insertion is done at the end of list and each new entry contains a pointer to the previous entry of the list.
They are usually implemented as heap structures.
This has 2 main benefits:
they are very easy to manage (you just need a pointer for each element)
if used in combination with an index you can overcome the problem of sequential access.
If instead you use an ordered list, by key value, you will have ease of access (binary search), but encounter problems each time you edit, delete, insert a new element: you must infact keep your list ordered after performing operation, making algorithms more complex and time consuming.
B+ trees are better structures, having all the properties you stated, and other advantages:
you can make group searches (by intervals of key values) with same cost of a single search: since elements in the leafs result automatically ordered thanks to the insertion algorithm, which is not possible in linked lists cause it would require many linear searches over the list.
cost is logarithmic with number of elements contained and especially since these structures are kept balanced cost of access does not depend on the particulare value you are looking for (very usefull).
these structures are very efficient in update, insert or delete operations.
Is it a good idea to create an index on a field which is VARCHAR (500) ? I am going to do a lot of search in it, but I am not sure if creating an index on such a 'big' field is a good idea?
What do you think?
It is usually not a good idea since the index files will be huge and the search relatively slow. It is better to use a prefix of the field such as the first 32 or 64 characters of the field as an index. Another possibility is that if it makes sense use a full text index,.
In general it's a good idea to create indexes on fields that you'll use for search. But, depending on the use, there are better options:
Full text search (from wikipedia): In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.
Partial index: (again, from wikipedia): In databases, a partial index, also known as filtered index is an index which has some condition applied to it so that it includes a subset of rows in the table.
Maybe you should consider giving more information on the use that index will have.
You should put indexes where often used queries will run faster, however, there are a number of issues to contemplate
Indexes have a very limited size, eg. mssql has a 900 byte limit
Many index may incur an overhead while writing (although minimal last time i benchmarked inserting a million entries on a table with 9 indexes
Many indexes takes up precious space in the db
many indexes may create deadlocks when inserting data
Also take a look at the documentation for the database you use. Most databases has support for text columns with efficient searching in them