Hashtables seem to be preferable in terms of disk access. What is the real reason that indexes usually implemented with a tree?
Sorry if it's infantile, but i did not find the straight answer on SO.
One of the common actions with data is to sort it or to search for data in a range - a tree will contain data in order while a hash table is only useful for looking up a row and has no idea of what the next row is.
So hash tables are no good for this common case, thanks to this answer
SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000
or
SELECT * FROM MyTable ORDER BY x
Obviously there are cases where hash tables are better but best to deal with the main cases first.
Size, btrees start small and perfectly formed and grow nicely to enormous sizes. Hashes have a fixed size which can be too big (10,000 buckets for 1000 entries) or too small (10,000 buckets for 1,000,000,000 entries) for the amount of data you have.
Hash tables provide no benefit for this case:
SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000
One has to only look at MySQL's hash index implementation associated with MEMORY storage engine to see its disadvantages:
They can be used with equality operators such as = but not with comparison operators such as <
The optimizer cannot use a hash index to speed up ORDER BY operations.
Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
Optimizer cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use).
And note that the above applies to hash indexes implemented in memory, without the added consideration of disk access matters associated with indexes implemented on disk.
Disk access factors as noted by #silentbicycle would skew it in favour of the balanced-tree index even more.
Databases typically use B+ trees (a specific kind of tree), since they have better disk access properties - each node can be made the size of a filesystem block. Doing as few disk reads as possible has a greater impact on speed, since comparatively little time is spent on either chasing pointers in a tree or hashing.
Hasing is good when the data is not increasing, more techically when N/n is constant ..
where N = No of elements and n = hash slots ..
If this is not the case hashing doesnt give a good performance gain.
In database most probably the data would be increasing a significant pace so using hash there is not a good idea.
and yes sorting is there too ...
"In database most probably the data would be increasing a significant pace so using hash there is not a good idea."
That is an over-exaggeration of the problem. Yes hash spaces must be fixed in size (modulo solutions ala extensible hashing) and yes, their size must be managed, and yes, someone must do that job.
That said, the performance gains if you exploit hash-based physical location to its fullest potential, are enormous.
Related
I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?
Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.
The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).
So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?
Well, I will speculate here (should probably be a comment - but it's going to be too long).
HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.
Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.
Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.
Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).
I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.
And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.
I was thinking of the reasoning behind it. Just thought of one use-case that was important in the context of text search. I could be totally wrong :)
Why sorted array and not Dictionary?
Yes, it performs well on range queries, but IMO Lucene was mainly built for text searches. Now imagine you were to do a search for prefix-based queries Eg: country:Ind*, you will need to scan the whole HashMap/Dictionary. Whereas this becomes log(n) if you have a sorted array.
Since we have a sorted array, it would be inefficient to update the array. Hence, in Lucene segments(inverted index resides in segments) are immutable.
A traditional B-tree implementation has O(n) space complexity [1].
So assume in a database (regardless of implementation, just consider general case), I have a table of 10GB data, currently the index size is 1GB, so can I assume that if the database grown to 100GB, my index size will be 10GB?
You cannot say anything "regardless of implementation."
If the index is a pure B-tree, then it theoretically should be linear in the number and size of keys being indexed with some fudge factor for fill rates. However, it is unlikely to be a pure B-tree. First, it might be a B+tree or some other variant. B+tree would add a very small logarithmic term to the size computation. That increase is unlikely to be material. More importantly, most implementations do not implement theoretical B-tree operations to maintain the fill rate. For example, deletion might be implemented by merely leaving an open slot to be used by a later insert. Over a large number of operations and with a bit of bad luck, the efficiency of the index representation can degrade so the index might get larger. If your index on 10GB is tightly packed and your 100GB is after a year of operations, it might be larger than you expect.
In direct answer to your question - no I do not think your assumption is safe. More because the index might change size over time than due to non-linearity in the underlying data structure.
I was debating between using BTREE index or HASH index.
Theoretically, what are the advantages of using HASH indexes?
When should they be chosen and more importantly, why?
I have read that hash indexes are good for point queries, but WHY?
I already know that BTREE indexes are best for range queries because you can easily traverse through the leaf nodes by going from left to right.
You don't mention a specific DBMS so this answer is pretty generic.
A properly performing hash index should reach the answer to a point query in a single fetch. A B-Tree will use something like lg_B(n) secondary storage accesses where B is the approximate branch factor and n is the number of entries. Caching and reasonable node sizes will likely keep that to a couple of fetches but still twice that for the hash index. In addition, each B-Tree access has non-trivial computations associated with it in order to traverse the sub-index in each node (something like lg_2(B) data comparison operations per node). The computation time for a hash index is usually very limited (a hash computation and a small number of data comparison operations - hopefully one). The computation time for searching within each node is often significant for B-Tree based indices.
In terms of picking, use a hash index if
you only expect point queries
you don't expect the data to fall into any poorly performing cases for the system hash function (oddball case but thought I should mention it)
B-Tree family are better if you have any kind of range query and/or want sorted results on a pre-determinable set of columns.
I'm working on a project where efficiency is crucial. A hash table would be very helpful since I need to easily look up the memory address of a node based on a key. The only problem I foresee is this hash table will need to handle up to 1 million entries. As I understand it usually hash tables buckets are a linked list so that they can handle multiple entries in the same bucket. It seems to me that with a million entries these lists would be way too slow. What is the common way of implementing something like this. Maybe swapping a standard linked list out for a skip list?
If you want a hash table with a million entries, normally you'd have at least 2 million buckets. I don't remember all the statistics (the key term is "birthday paradox"), but the vast majority of the buckets will have zero or one items. You can, in principle, be very unlucky and get all items in one bucket - but you'd have to be even more unlucky than those people who seem to get struck by lightning every other day.
For hashtables that grow, the normal trick is to grow by a constant percentage - the usual textbook case being growth by doubling the hash-table size. You do this whenever the number of items in the hashtable reaches a certain proportion of the hashtable size, irrespective of how many buckets are actually being used. This gives amortized expected performance of O(1) for inserts, deletes and searches.
The linked list in each bucket of a hash-table is just a way of handling collisions - improbable in a per-operation sense, but over the life of a significant hash table, they do happen - especially as the hash-table gets more than half full.
Linked lists aren't the only way to handle collisions - there's a huge amount of lore about this topic. Walter Bright (developer of the D programming language) has advocated using binary trees rather than linked lists, claiming that his Dscript gained a significant performance boost relative to Javascript from this design choice.
He used simple (unbalanced) binary trees when I asked, so the worst-case performance was the same as for linked lists, but the key point I guess is that the binary tree handling code is simple, and the hash table itself makes the odds of building large unbalanced trees very small.
In principle, you could just as easily use treaps, red-black trees or AVL trees. An interesting option may be to use splay trees for collision handling. But overall, this is a minor issue for a few library designers and a few true obsessives to worry about.
You lose all the advantages of a hash table if the per-bucket lists ever have more than a few entries. The usual way to make a hash table scale to millions of entries is to make the primary hash array resizable, so even with millions of entries, the bucket lists stay short.
You can use a Tree instead of a List in the individual "buckets". (AVL or similar)
EDIT: well, Skip List would do too. (and seems to be faster) - O(log n) is what you aim for.
The total number of entries does not matter, only the average number of entries per bucket (N / size of hash). Use a hash function with larger domain (for example, 20 bits, or even larger) to ensure that.
Of course, this will take up more memory, but that's it, it's a common memory vs speed tradeoff.
Not sure if this will help you or not, but maybe: http://memcached.org/
If your keys have normal distribution (That's a very big IF), then the expected number of insertions into the hashtable to exhaust all the buckets in the hashtable is M*logM ( Natural log, to the base e), where M is the number of buckets.
Was surprised couldn't find this easily online!
I have posted the derivation of the same on my blog,and verified it with Code, using rand().It does seem to be a pretty good estimate.
What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.
Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.
my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).
Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work
If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.
We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.