Is it OK for a non leaf page in a B+ tree to hold a number which is not equal to none of the leaf pages?
Yes, the keys in the internal blocks may be different from the actual keys that are stored in the bottom (data) layer of the B+ tree.
These internal keys only serve to set intervals, not to represent the actual key. For instance, to find a key in the tree, the traversal will always be to the bottom layer of the tree, and only there it is identified (or determined it does not exist). The internal keys only serve to determine the direction of the search.
Like you suggest, this can reduce the update-work needed when a key is deleted. Unless this deletion triggers a merge, there is no need to go upwards in the tree and update keys.
The fact that keys in internal blocks do not need to correspond to actual data keys, is also used for saving space. As Wikipedia mentions:
For internal blocks, space saving can be achieved by either compressing keys or pointers. For string keys, space can be saved by using the following technique: Normally the πth entry of an internal block contains the first key of block π + 1. Instead of storing the full key, we could store the shortest prefix of the first key of block π + 1 that is strictly greater (in lexicographic order) than last key of block π.
Related
I am working on a project in C where I have to store, sort and update the information obtained in real time. The maximum amount of information I can have is defined. The information obtained is <key,value1,value2>, but is sorted according to the key and value1. The key would indicate the start of the node and value1 would indicate its size.
The basic operations I would need to perform here are insertion, search, deletion but most importantly merge if I find a sequential information.
As an example, in an empty structure, I input <100,1> - this will create one node.
Next if I input <102,2> - this will create another node. So 2 nodes would exist in the tree.
Next if I input <101,1> - this should at the end create only one node in the tree: <100,4>
I also want to separately sort these nodes according to their value2. Note that value2 could be updated in real-time.
I was thinking about B+ tree because of its logarithmic performance in all cases since all leaf nodes are at the same level. And by the use of separate doubly linked list, I can create separate links to sort the nodes according to value2.
But the overhead for sorting according to <key,value1> would be I always have to do a search operation; for merge, for add and for delete.
Any thoughts/suggestions about this?
How does lua handle a table's growth?
Is it equivalent to the ArrayList in Java? I.e. one that needs continuous memory space, and as it grows bigger than the already allocated space, the internal array is copied to another memory space.
Is there a clever way to led with that?
My question is, how is a table stored in the memory? I'm not asking how to implement arrays in Lua.
(Assuming you're referring to recent versions of Lua; describing the behavior of 5.3 which should be (nearly?) the same for 5.0-5.2.)
Under the hood, a table contains an array and a hash part. Both (independently) grow and shrink in power-of-two steps, and both may be absent if they aren't needed.
Most key-value pairs will be stored in the hash part. However, all positive integer keys (starting from 1) are candidates for storing in the array part. The array part stores only the values and doesn't store the keys (because they are equivalent to an element's position in the array). Up to half of the allocated space is allowed to be empty (i.e. contain nils β either as gaps or as trailing free slots). (Array candidates that would leave too many empty slots will instead be put into the hash part. If the array part is full but there's leftover space in the hash part, any entries will just go to the hash part.)
For both array and hash part, insertions can trigger a resize, either up to the next larger power of two or down to any smaller power of two if sufficiently many entries have been removed previously. (Actually triggering a down-resize is non-trivial: rehash is the only place where a table is resized (and both parts are resized at the same time), and it is only called from luaH_newkey if there wasn't enough space in either of the two parts1.)
For more information, you can look at chapter 4 of The Implementation of Lua 5.0, or inspect the source: Basically everything of relevance happens in ltable.c, interesting starting points for reading are rehash (in ltable.c) (the resizing function), and the main interpreter loop luaV_execute (in lvm.c) or more specifically luaV_settable (also there) (what happens when storing a key-value pair in a table).
1As an example, in order to shrink a table that contained a large array part and no hash, you'd have to clear all array entries and then add an entry to the hash part (i.e. using a non-integer key, the value may be anything including nil), to end up with a table that contains no array part and a 1-element hash part.
If both parts contained entries, you'd have to first clear the hash part, then add enough entries to the array part to fill both array and hash combined (to trigger a resize which will leave you with a table with a large array part and no hash), and subsequently clear the array as above.2 (First clearing the array and then the hash won't work because after clearing both parts you'll have no array and a huge hash part, and you cannot trigger a resize because any entries will just go to the hash.)
2Actually, it's much easier to just throw away the table and make a new one. To ensure that a table will be shrunk, you'd need to know the actual allocated capacity (which is not the current number of entries, and which Lua won't tell you, at least not directly), then get all the steps and all the sizes just right β mix up the order of the steps or fail to trigger the resize and you'll end up with a huge table that may even perform slower if you're using it as an arrayβ¦ (Array candidates stored in the hash also store their keys, for half the amount of useful data in e.g. a cache line.)
Since Lua 5.0, tables are an hybrid of hash table and array. From The Implementation of Lua 5.0:
New algorithm for optimizing tables used as arrays:
Unlike other scripting languages,
Lua does not offer an array type. Instead, Lua programmers use
regular tables with integer indices to implement arrays. Lua 5.0 uses a new
algorithm that detects whether tables are being used as arrays and automatically
stores the values associated to numeric indices in an actual array,
instead of adding them to the hash table. This algorithm is discussed in
Section 4.
Prior versions had only the hash table.
I'm trying to implement B+ tree (in C language) with each key being some data(int/float/string) and corresponding value is a list, whose size is not fixed.
I want to store this tree in a file and access later on, when required. You may consider the implementation as follows:
Each search key corresponds to a page in the file and
Each page contains set of values corresponding to that key
The problem is: I cannot just assign a page to a key, as it may consume very little and waste the entire page. So I need a persistent way of implementing B+ tree in file system, instead of main-memory.
Check this disk-based B-tree implementation out, it might be help.
And this paper titled Fully Persistent B+-Trees
i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.
Associative arrays are usually implemented with Hashtables. But recently, I came to know that they can also be implemented using Trees. Can someone explain how to implement using a Tree?
Should we use a simple binary tree or BST?
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
Lastly, one general question about the hash table. People say the search time is O(1) in hash table. But when we say O(1), do we take into account of calculating the hash value before we look up using the hash value? If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
Hash table solutions will take the hash of objects stored in them (its important to note that the hash is a primitive, often an integer or long) and use that hash as an array index, and store a reference to the object at that index in the array. To solve the collision problem, indices in this array will often contain linked list nodes that hold the actual references.
Checking if an object is in the array is as simple as hashing it, looking in the index referred to by the hash, and comparing equality with each object that's there, an operation that runs in amortized constant time, because the hash table can grow larger if collisions are beginning to accumulate.
Now for your questions.
Question 1
Should we use a simple binary tree or a BST?
BST stands for Binary Search Tree. The difference between it and a "simple" binary tree is that it is strictly ordered. Optimal implementations will also attempt to keep the tree balanced. Searching through a balanced, ordered tree is much easier than an unordered one, because certain assumptions about the location of the target element that cannot be made in an unordered tree. You should use a BST if practical.
Question 2
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
That would be what a hash table does. Even hash tables must store the keys by value for equality checking due to the possibility of collisions. The BST would not store hashes at all because under all nonsimple circumstances, determining sort order from the hash values would be very difficult. You would use (key, value) pairs without any hashing.
Question 3
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
As you noticed, it doesn't work that way. So we store the value of the key instead of the hash. Ordering the tree (as opposed to an unordered tree) gives us a huge advantage when searching (O(log(N)) instead of O(N)). In an unordered structure, some specific element must be exhaustively searched for because it could reside anywhere in the structure. In an ordered structure, it will only exist above keys whose values are less than its, and below keys whose values are greater. Consider a textbook. Would you have an easier or harder time jumping to a specific page if the pages were in a random order?
Question 4
If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
I've asked myself the same question before, and it actually depends on the hash function. "Amortized constant" lookup's worst case time is:
O(H) + O(C)
O(H) is the worst case complexity of the hash function. A smart hash function might look at only the first few dozen characters of a string, or the last few, or some in the middle, or something. It's not necessary for the hash function to be secure, it just has to be deterministic (i.e. return the same value for the identical objects). Regardless of how good your function is, you will get collisions anyway, so if you can trade off doing a huge amount of extra work for a slightly more collidable function, it's often worth it.
O(C) is the worst case complexity of comparing the keys. An unsuccessful lookup knows that there are no matches because no entry in the table exists for its hash value. A successful lookup however always must compare the provided key with the ones in the table, otherwise the possibility exists that the key is not actually a match but merely a collision. Worst possible case is if there are multiple collisions for one hash value, the provided key must be compared with all of the stored colliding keys one after another until a match is found or all comparisons fail. This gotcha is why it's only amortized constant time instead of just constant time: as the table grows, the chances of having collisions decreases, so this happens less frequently, and the average time required to search some collection of elements will always tend toward a constant.
When people say hash is O(1) (or you could say O(1+0n) and tree is O(log n), they mean where n is size of collection.
You are right in that it takes O(m) (where m is length of currently examinded string) aditional work, but usualy m has some upper bound and n tends to get large. So m can be considered constant for both of implementations.
And m is in absolute terms more influential in tree implementation. Because you compare with key in every node you visit, where in hash you only compute hash and compare whole key with all values in one bucket determined by hash (with good hash function and big enough table, there should usualy be only one value there).