Associative arrays are usually implemented with Hashtables. But recently, I came to know that they can also be implemented using Trees. Can someone explain how to implement using a Tree?
Should we use a simple binary tree or BST?
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
Lastly, one general question about the hash table. People say the search time is O(1) in hash table. But when we say O(1), do we take into account of calculating the hash value before we look up using the hash value? If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
Hash table solutions will take the hash of objects stored in them (its important to note that the hash is a primitive, often an integer or long) and use that hash as an array index, and store a reference to the object at that index in the array. To solve the collision problem, indices in this array will often contain linked list nodes that hold the actual references.
Checking if an object is in the array is as simple as hashing it, looking in the index referred to by the hash, and comparing equality with each object that's there, an operation that runs in amortized constant time, because the hash table can grow larger if collisions are beginning to accumulate.
Now for your questions.
Question 1
Should we use a simple binary tree or a BST?
BST stands for Binary Search Tree. The difference between it and a "simple" binary tree is that it is strictly ordered. Optimal implementations will also attempt to keep the tree balanced. Searching through a balanced, ordered tree is much easier than an unordered one, because certain assumptions about the location of the target element that cannot be made in an unordered tree. You should use a BST if practical.
Question 2
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
That would be what a hash table does. Even hash tables must store the keys by value for equality checking due to the possibility of collisions. The BST would not store hashes at all because under all nonsimple circumstances, determining sort order from the hash values would be very difficult. You would use (key, value) pairs without any hashing.
Question 3
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
As you noticed, it doesn't work that way. So we store the value of the key instead of the hash. Ordering the tree (as opposed to an unordered tree) gives us a huge advantage when searching (O(log(N)) instead of O(N)). In an unordered structure, some specific element must be exhaustively searched for because it could reside anywhere in the structure. In an ordered structure, it will only exist above keys whose values are less than its, and below keys whose values are greater. Consider a textbook. Would you have an easier or harder time jumping to a specific page if the pages were in a random order?
Question 4
If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
I've asked myself the same question before, and it actually depends on the hash function. "Amortized constant" lookup's worst case time is:
O(H) + O(C)
O(H) is the worst case complexity of the hash function. A smart hash function might look at only the first few dozen characters of a string, or the last few, or some in the middle, or something. It's not necessary for the hash function to be secure, it just has to be deterministic (i.e. return the same value for the identical objects). Regardless of how good your function is, you will get collisions anyway, so if you can trade off doing a huge amount of extra work for a slightly more collidable function, it's often worth it.
O(C) is the worst case complexity of comparing the keys. An unsuccessful lookup knows that there are no matches because no entry in the table exists for its hash value. A successful lookup however always must compare the provided key with the ones in the table, otherwise the possibility exists that the key is not actually a match but merely a collision. Worst possible case is if there are multiple collisions for one hash value, the provided key must be compared with all of the stored colliding keys one after another until a match is found or all comparisons fail. This gotcha is why it's only amortized constant time instead of just constant time: as the table grows, the chances of having collisions decreases, so this happens less frequently, and the average time required to search some collection of elements will always tend toward a constant.
When people say hash is O(1) (or you could say O(1+0n) and tree is O(log n), they mean where n is size of collection.
You are right in that it takes O(m) (where m is length of currently examinded string) aditional work, but usualy m has some upper bound and n tends to get large. So m can be considered constant for both of implementations.
And m is in absolute terms more influential in tree implementation. Because you compare with key in every node you visit, where in hash you only compute hash and compare whole key with all values in one bucket determined by hash (with good hash function and big enough table, there should usualy be only one value there).
Related
I've seen many examples or articles explaining hash map based on array.
For example, all of the data is stored in an array, and you can get an index of a value by calling hash function with its key.
So, in case that there is no collision, any access to hash map is O(1). As the access is actually for an array, knowing its index.
My question here is that, is there any implementation of specific language or library that hash map is not built on array?
Should hash map must be built on array?
While there are other data structures that use both hash functions and data structures other than arrays to provide key/value mappings (you can read about some at https://en.wikipedia.org/wiki/Hash_trie), I'd argue that they are not "hash maps" as the combined phrase is generally intended and understood. When people talk about hash maps, they generally expect a hash table underneath, and that will have one array of buckets (or occasionally a couple to allow a gradual resizing for more consistent performance).
A general expectation for hash maps is that they provide amortised O(1) lookup, and that is only possible if - once the key is hashed - you can do O(1) lookup using the hash value. An array is the obvious data structure with that property. There are some other options, like having a fixed-depth tree of arrays, the upper levels holding pointers to the lower levels, with the final level being either values or pointers to them. At least one level must be allowed to grow arbitrarily large for the overall container to support arbitrary size. With a constant maximum depth, the number of lookup steps isn't correlated with the number of keys or elements stored, so it's still O(1) lookup. That provides partial mitigation of the cost of copying/moving elements between large arrays more often, but at the cost of constantly having a greater (but constant) number of levels to traverse during lookups.
A C++ std::deque utilises this kind of structure to provide O(1) lookup. See STL deque accessing by index is O(1)? for explanations, diagrams. That said, I've never seen anyone choose to use a deque when implementing a hash map.
So I am still in the theory portion of my project (a phonebook) and am wondering how to store multiple values for a single key in a TRIE structure. When I looked it up most people said when creating a phone book use a TRIE: but if I wanted to store number, email, address, etc all under the key - which would be the name - how would that work? Could I still use a TRIE? Or am I thinking about this the wrong way? Thanks.
I think usually you would create separate indexes for each dimension (i.e. phone number, name, ...). These would normally be B-trees or Hashmaps.
Generally, these individual indexes will allow faster lookup than a multi-dimensional index (even though multidimensional TRIEs have very fast look up speed).
If you really want to use a multi-dimensional try, have a look at the PH-Tree, it is a true multi-dimensional TRIE (disclaimer: I am self-advertising here).
There are Java and C++ implementations, but they are all aimed at 64bit numbers, e.g. coordinates in space. WARNING: The available implementations allow only one entry per key, so you will have to store Lists for each key in order to allow multiple entries per key.
If you want to use the PH-Tree for strings etc (I would treat the phone number as a string): you can either write your own PH-Tree (not easy to do) or encode the strings in a magic number. For example, convert the leading six characters into numbers by using their ASCII code and create a small hash of the whole string and store the hash in the remaining two bytes. There are many ways to improve this number. This number can then be used as one dimension of the key.
Conceptually, the PH-Tree interleaves the bits of all dimensions into a single long bitstring. These long bit-strings are then stored in a TRIE, however there are some quirks (i.e. each node has up to 2^dimension children). For a full explanation, please have a look at the papers on the website.
In summary: I would not use a multi-dimensional index unless you really need it. If you need to use a multi-dimensional index, the PH-Tree may be a good choice, it is very fast to update and scales comparatively well with increasing dimensionality.
Let's say you want to read from file and based on its content create a structure (or an array) containing multiple objects, for example:
struct {
unsigned id;
char name[16];
float price;
} *items;`
Then you want to reference to an object (some item) using the obtained name, because that's how a user would know what to look for.
However, implementing searches that use loops will be very slow, especially if you have to loop every time you want to access an item and you need to access it all the time. Converting string to integer and then using a lookup table (sacrificing memory for performance) is a solution, but what if the name is longer than 8 bytes.
What is the fastest approach of accessing allocated structure, filled from an arbitrary file, using name identifiers (a string from the structure) ?
Using a binary search tree for this is ideal. Of course, more advanced structures like B-Trees will probably increase performance. But knowing about this is all you need really, and binary search trees are pretty efficient in many cases like the one you describe above, they still are simple and easy to implement.
A simple method is to use bsearch() from the standard library, but then insertion into the data collection becomes difficult or inefficient.
These are still loop based, but they are far more efficient than linear lookup.
One approach would be to use a hash table using a hashing function that takes a string.
As long as the table is suitably sized and the hashing function distributes evenly across the table it will remain efficient to add and find entries. But, if collisions occur, then some looping will be required to search through the entries with the same hash value.
How does lua handle a table's growth?
Is it equivalent to the ArrayList in Java? I.e. one that needs continuous memory space, and as it grows bigger than the already allocated space, the internal array is copied to another memory space.
Is there a clever way to led with that?
My question is, how is a table stored in the memory? I'm not asking how to implement arrays in Lua.
(Assuming you're referring to recent versions of Lua; describing the behavior of 5.3 which should be (nearly?) the same for 5.0-5.2.)
Under the hood, a table contains an array and a hash part. Both (independently) grow and shrink in power-of-two steps, and both may be absent if they aren't needed.
Most key-value pairs will be stored in the hash part. However, all positive integer keys (starting from 1) are candidates for storing in the array part. The array part stores only the values and doesn't store the keys (because they are equivalent to an element's position in the array). Up to half of the allocated space is allowed to be empty (i.e. contain nils – either as gaps or as trailing free slots). (Array candidates that would leave too many empty slots will instead be put into the hash part. If the array part is full but there's leftover space in the hash part, any entries will just go to the hash part.)
For both array and hash part, insertions can trigger a resize, either up to the next larger power of two or down to any smaller power of two if sufficiently many entries have been removed previously. (Actually triggering a down-resize is non-trivial: rehash is the only place where a table is resized (and both parts are resized at the same time), and it is only called from luaH_newkey if there wasn't enough space in either of the two parts1.)
For more information, you can look at chapter 4 of The Implementation of Lua 5.0, or inspect the source: Basically everything of relevance happens in ltable.c, interesting starting points for reading are rehash (in ltable.c) (the resizing function), and the main interpreter loop luaV_execute (in lvm.c) or more specifically luaV_settable (also there) (what happens when storing a key-value pair in a table).
1As an example, in order to shrink a table that contained a large array part and no hash, you'd have to clear all array entries and then add an entry to the hash part (i.e. using a non-integer key, the value may be anything including nil), to end up with a table that contains no array part and a 1-element hash part.
If both parts contained entries, you'd have to first clear the hash part, then add enough entries to the array part to fill both array and hash combined (to trigger a resize which will leave you with a table with a large array part and no hash), and subsequently clear the array as above.2 (First clearing the array and then the hash won't work because after clearing both parts you'll have no array and a huge hash part, and you cannot trigger a resize because any entries will just go to the hash.)
2Actually, it's much easier to just throw away the table and make a new one. To ensure that a table will be shrunk, you'd need to know the actual allocated capacity (which is not the current number of entries, and which Lua won't tell you, at least not directly), then get all the steps and all the sizes just right – mix up the order of the steps or fail to trigger the resize and you'll end up with a huge table that may even perform slower if you're using it as an array… (Array candidates stored in the hash also store their keys, for half the amount of useful data in e.g. a cache line.)
Since Lua 5.0, tables are an hybrid of hash table and array. From The Implementation of Lua 5.0:
New algorithm for optimizing tables used as arrays:
Unlike other scripting languages,
Lua does not offer an array type. Instead, Lua programmers use
regular tables with integer indices to implement arrays. Lua 5.0 uses a new
algorithm that detects whether tables are being used as arrays and automatically
stores the values associated to numeric indices in an actual array,
instead of adding them to the hash table. This algorithm is discussed in
Section 4.
Prior versions had only the hash table.
i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.