B Tree with varying maximum keys? - database

I am wanting to implement something along the lines of a btree to index some data using variable length keys, where I am expecting that each node in the tree will look something like this,
struct key_block {
block_ptr parent; // link back up the tree to the parent
unsigned numkeys; // number of keys currently used by this block
struct {
block_ptr child; // points to the child immediately preceeding this key.
struct {
unsigned length; // how long this key is
unsigned offset; // where the data for this key is
} key; // support for variable length keys
data_ptr content; // ptr to the data indexed by this key
} entries[]; // as many entries as will fit on a disk block.
}; // the last entry will be followed by another block_ptr which is the right hand child of the last node.
What I am intending to is store the actual key data in the same disk block as the node itself, positioned just after the final key and its right hand child that was within the node. The offset and length fields in each key will indicate how far from the start of the current block the actual data for each key begins and how long it runs for.
However, I am wanting to use a fixed size of disk block for my storage, and since I am wanting to store the variable length keys inside of the same block as the node, that means that the maximum number of keys that can be in a node depends on the length of the keys in that node. This sort of contradicts my understanding of the way a btree generally works, where all nodes have a fixed maximum number of entries, and I am not sure whether I can really implement this using a btree at all because I am violating that typical invariant.
So should I even be looking at using a btree structure? If not, what other alternatives exist for very fast external searching, insertion, and deletion? In particular, a key criteria for any solution must be that it is highly scalable to supporting very VERY large numbers of entries, and still be efficient for searching, insertion, and deletion (and btrees perform adequately on this front).
If I can still use a btree, how would the algorithm be affected when I no longer have an invariant maximum number of keys, but instead the maximum depends on the content of each individual node itself?

There is no fundamental issue with a variable number of maximum keys in a B-tree. However, a B-tree does depend on some minimum and maximum number of keys in each node. If you have a fixed number of keys per node, then this is easy (usually N/2 to N nodes). Because you allow a variable number, you will need to determine a heuristic for balancing the tree. The better the heuristic, the more optimal the performance.
Fortunately, the issue will merely be performance. The shape of the B-tree has several invariants, but none of them are affected by your variable number of keys, so you will still be able to search. It just might be a poorly-balancing structure if you choose a bad heuristic.

Related

Robin Hood hashing in C

I'm implementing a Hashtable that handles collisions with robin hood hashing. However, previously i had chaining instead, and the process of inserting almost 1 million keys was pretty much instantaneous. The same doesn't happen with the Robin Hood hashing which i found strange since i had the impression it was much quicker. So what i want to ask is if my insertion function is properly implemented. Here's the code:
typedef struct hashNode{
char *word;
int freq; //not utilized in the insertion
int probe;// distance from the calculated index to the actual index it was inserted.
struct hashNode* next; //not utilized in the insertion
struct hashNode* base; //not utilized in the insertion
}hashNode;
typedef hashNode* hash_ptr;
hash_ptr hashTable[NUM_WORDS] = {NULL}; // NUM_WORDS = 1000000
// Number of actual entries = 994707
hash_ptr swap(hash_ptr node, int index){
hash_ptr temp = hashTable[index];
hashTable[index] = node;
return temp;
}
static void insertion(hash_ptr node,int index){
while(hashTable[index]){
if(node->probe > hashTable[index]->probe){
node = swap(node,index);
}
node->probe++;
index++;
if(index > NUM_WORDS) index = 0;
}
hashTable[index] = node;
}
To contextualize everything:
the node parameter is the new entry.
the index parameter is where the new entry will be, if it isn't occupied.
The Robin Hood algorithm is very clever but it is just as dependent on having a good hash function as is any other open hashing technique.
As a worst case, consider the worst possible hash function:
int hash(const char* key) { return 0; }
Since this will map every item to the same slot, it is easy to see that the total number of probes is quadratic in the number of entries: the first insert succeeds on the first probe; the second insert requires two probes; the third one three probes; and so on, leading to a total of n(n+1)/2 probes for n inserts. This is true whether you use simple linear probing or Robin Hood probing.
Interestingly, this hash function might have no impact whatsoever on insertion into a chained hash table if -- and this is a very big if -- no attempt is made to verify that the inserted element is unique. (This is the case in the code you present, and it's not totally unreasonable; it is quite possible that the hash table is being built as a fixed lookup table and it is already known that the entries to be added are unique. More on this point later.)
In the chained hash implementation, the non-verifying insert function might look like this:
void insert(hashNode *node, int index) {
node->next = hashTable[index];
hashTable[index] = node;
}
Note that there is no good reason to use a doubly-linked list for a hash chain, even if you are planning to implement deletion. The extra link is just a waste of memory and cycles.
The fact that you can build the chained hash table in (practically) no time at all does not imply that the algorithm has built a good hash table. When it comes time to look a value up in the table, the problem will be discovered: the average number of probes to find the element will be half the number of elements in the table. The Robin Hood (or linear) open-addressed hash table has exactly the same performance, because all searches start at the beginning of the table. The fact that the open-addressed hash table was also slow to build is probably almost irrelevant compared to the cost of using the table.
We don't need a hash function quite as terrible as the "always use 0" function to produce quadratic performance. It's sufficient for the hash function to have an extremely small range of possible values (compared with the size of the hash table). If the possible values are equally likely, the chained hash will still be quadratic but the average chain length will be divided by the number of possible values. That's not the case for the linear/R.Hood probed hash, though, particularly if all the possible hash values are concentrated in a small range. Suppose, for example, the hash function is
int hash(const char* key) {
unsigned char h = 0;
while (*key) h += *key++;
return h;
}
Here, the range of the hash is limited to [0, 255). If the table size is much larger than 256, this will rapidly reduce to the same situation as the constant hash function. Very soon the first 256 entries in the hash table will be filled, and every insert (or lookup) after that point will require a linear search over a linearly-increasing compact range at the beginning of the table. So the performance will be indistinguishable from the performance of the table with a constant hash function.
None of this is intended to motivate the use of chained hash tables. Rather, it is pointing to the importance of using a good hash function. (Good in the sense that the result of hashing a key is uniformly distributed over the entire range of possible node positions.) Nonetheless, it is generally the case that clever open-addressing schemes are more sensitive to bad hash functions than simple chaining.
Open-addressing schemes are definitely attractive, particularly in the case of static lookup tables. They are more attractive in the case of static lookup tables because deletion can really be a pain, so not having to implement key deletion removes a huge complication. The most common solution for deletion is to replace the deleted element with a DELETED marker element. Lookup probes must still skip over the DELETED markers, but if the lookup is going to be followed by an insertion, the first DELETED marker can be remembered during the lookup scan, and overwritten by the inserted node if the key is not found. That works acceptably, but the load factor has to be calculated with the expected number of DELETED markers, and if the usage pattern sometimes successively deletes a lot of elements, the real load factor for the table will go down significantly.
In the case where deletion is not an issue, though, open-addressed hash tables have some important advantages. In particular, they are much lower overhead in the case that the payload (the key and associated value, if any) is small. In the case of a chained hash table, every node must contain a next pointer, and the hash table index must be a vector of pointers to node chains. For a hash table whose key occupies only the space of a pointer, the overhead is 100%, which means that a linear-probed open-addressed hash table with a load factor of 50% occupies a little less space than a chained table whose index vector is fully occupied and whose nodes are allocated on demand.
Not only is the linear probed table more storage efficient, it also provides much better reference locality, which means that the CPU's RAM caches will be used to much greater advantage. With linear probing, it might be possible to do eight probes using a single cacheline (and thus only one slow memory reference), which could be almost eight times as fast as probing through a linked list of randomly allocated table entries. (In practice, the speed up won't be this extreme, but it could well be more than twice as fast.) For string keys in cases where performance really matters, you might think about storing the length and/or the first few characters of the key in the hash entry itself, so that the pointer to the full character string is mostly only used once, to verify the successful probe.
But both the space and time benefits of open addressing are dependent on the hash table being an array of entries, not an array of pointers to entries as in your implementation. Putting the entries directly into the hash index avoids the possibly-significant overhead of a pointer per entry (or at least per chain), and permits the efficient use of memory caches. So that's something you might want to think about in your final implementation.
Finally, it's not necessarily the case that open addressing makes deletion complicated. In a cuckoo hash (and the various algorithms which it has inspired in recent years), deletion is no more difficult than deletion in a chained hash, and possibly even easier. In a cuckoo hash, any given key can only be in one of two places in the table (or, in some variants, one of k places for some small constant k) and a lookup operation only needs to examine those two places. (Insertion can take longer, but it is still expected O(1) for load factors less than 50%.) So you can delete an entry simply by removing it from where it is; that will have no noticeable effect on lookup/insertion speed, and the slot will be transparently reused without any further intervention being necessary. (On the down side, the two possible locations for a node are not adjacent and they are likely to be on separate cache lines. But there are only two of them for a given lookup. Some variations have better locality of reference.)
A few last comments on your Robin Hood implementation:
I'm not totally convinced that a 99.5% load factor is reasonable. Maybe it's OK, but the difference between 99% and 99.5% is so tiny that there is no obvious reason to tempt fate. Also, the rather slow remainder operation during the hash computation could be eliminated by making the size of the table a power of two (in this case 1,048,576) and computing the remainder with a bit mask. The end result might well be noticeably faster.
Caching the probe count in the hash entry does work (in spite of my earlier doubts) but I still believe that the suggested approach of caching the hash value instead is superior. You can easily compute the probe distance; it's the difference between the current index in the search loop and the index computed from the cached hash value (or the cached starting index location itself, if that's what you choose to cache). That computation does not require any modification to the hash table entry, so it's cache friendlier and slightly faster, and it doesn't require any more space. (But either way, there is a storage overhead, which also reduces cache friendliness.)
Finally, as noted in a comment, you have an off-by-one error in your wraparound code; it should be
if(index >= NUM_WORDS) index = 0;
With the strict greater-than test as written, your next iteration will try to use the entry at index NUM_WORDS, which is out of bounds.
Just to leave it here: the 99% fillrate is not resonable. Nether is 95%, nor 90%. I know they said it in the paper, they are wrong. Very wrong. Use 60%-80% like you should with open adressing
Robin Hood hashing does not change the number or collisions when you are inserting, the average (and the total) number of collisions remain the same. Only their distribution changes: Robin Hood improves the worst cases. But for the averages it's the same as linear, quadratic or double hashing.
at 75% you get about 1.5 collisions before a hit
at 80% about 2 collisions
at 90% about 4.5 collisions
at 95% about 9 collisions
at 99% about 30 collisions
I tested on a 10000 element random table. Robin Hood hashing cannot change this average, but it improves the 1% worst case number of collisions from having 150-250 misses (at 95% fill rate) to having about 30-40.

How to serialize a Graph-like AVL tree to disk?

I know it sounds weird but this is it. I have a data structure which is basically a modified AVL tree. Each node of the the structure has a left child and a right child. These core pointers (left & right) will be used to link all data nodes together and to keep the data structure balanced (AVL rotations) to improve searching. But those are not the only pointers in the structure, there are others that can point to any random node in the tree (Which creates the graph-like analogy).
The tree is built at runtime through user interaction (CLI). The user is also responsible for creating all the different links between the nodes.
An example of such a data structure could be (Didn't start coding yet, it's only prototyping):
struct node {
struct node *left;
struct node *right
struct node *links[NUM]; // Points to any random node in the tree.
/* Probably many other fields here that could be either pointers
or other data types */
}
Now, everything is in RAM. Once the user wants to exit, all the data nodes (The whole tree) should be saved to a file in binary mode (For later reloading, so one must take this in consideration).
It's, basically, easy to save the AVL tree using one the recursive tree traversal algorithms (In this case the question is a duplicate because solutions already exist in SO). But, in my case, i have to preserve all the arbitrarily created links between the nodes.
What could be the most efficient way in time & space ?
You could dump your data structure as is (including the pointer values) and, in the binary blob of each node, also add its address. When reloading the data structure you will dynamically allocate your nodes and store their new addresses in a hash table which access keys are the old addresses. In a final pass you will parse your hash table sequentially (not using the old addresses as keys), retrieve the new address of each node, and update its pointer fields from old addresses to new addresses using again your hash table as a translation table (with the old addresses as access keys).
Choose a unique index number for each node, and use it to serialize the links.
This will likely take two traversal passes -- one to set the index number, and one to do the serialization. Add an integer field to your node to hold the index number; you shouldn't need any other memory overhead.
Alternately, if you manage your tree nodes by storing them in an array or std::vector, you will already have an index number handy, and you won't need an additional index field. Also, you can store all your links as indices instead of pointers, so you can just serialize your container as-is.

Can I represent a red black tree as an array?

Is it worth representing a red black tree as an array to eliminate the memory overhead. Or will the array take up more memory since the array will have empty slots?
It will have both positive and negative sides. This answer is applicable for C [since you mentioned this is what you will use]
Positive sides
Lets assume you have created an array as pool of objects that you will use for red-black tree. Deleting an element or initializing a new element when the position is found will be a little fast, because you probably will use the memory pool you have created yourself.
Negative sides
Yes the array will most probably end up taking more memory since the array will have empty slots sometimes.
You have to be sure about the MAX size of the red-black trees in this case. So there is a limitation of size.
You are not using the benefit of sequential memory space, so that might be a waste of resource.
Yes, you can represent red-black tree as an array, but it's not worth it.
Maximum height of red-black tree is 2*log2(n+1), where n is number of entries. Number of entries in array representation on each level is 2**n, where n is level. So to store 1_000 entries you'd have to allocate array of 1_048_576 entries. To store 1_000_000 entries you'd have to allocate array of 1_099_511_627_776 entries.
It's not worth it.
Red-back tree (and most data structures, really) doesn't care about which storage facility is used, that means you can use array or even HashTable/Map to store the tree node, the array index or map key is your new "pointer". You can even put the tree on the disk as a file and use file offset as node index if you would like to (though, in this case, you should use B-Tree instead).
The main problem is increased complexity as now you have to manage the storage manually (opposed to letting the OS and/or language runtime do it for you). Sometimes you want to scale the array up so you can store more items, sometimes you want to scale it down (vacuum) to free up unused space. These operations can be costly on their own.
Memory usage wise, storage facility does not change how many nodes on the tree. If you have 2,000 nodes on your old school pointerer tree (tree height=10), you'll still have 2,000 nodes on your fancy arrayilized tree (tree height is still 10). However, redundant space may exists in between vacuum operations.

binary seach tree index implementation using symbol tables

I am reading about index implementation using symbol tables in book by author Robert Sedwick in Algorithms in C++.
Below is snippet from the book
We can adapt binary search trees to build indices in precisely the
same manner as we provided indirection for sorting and for heaps.
Arrange for keys to be extracted from items via the key member
function, as usual. Moreover, we can use parallel arrays for the
links, as we did for linked lists. We use three arrays, one each for
the items, left links, and right links. The links are array indices
(integers), and we replace link references such as
x = x->l
in all our code with array references such as
x = l[x].
This approach avoids the cost of dynamic memory allocation for each
nodeā€”the items occupy an array without regard to the search function,
and we preallocate two integers per item to hold the tree links,
recognizing that we will need at least this amount of space when all
the items are in the search structure. The space for the links is not
always in use, but it is there for use by the search routine without
any time overhead for allocation. Another important feature of this
approach is that it allows extra arrays (extra information associated
with each node) to be added without the tree-manipulation code being
changed at all. When the search routine returns the index for an item,
it gives a way to access immediately all the information associated
with that item, by using the index to access an appropriate array.
This way of implementing BSTs to aid in searching large arrays of
items is sometimes useful, because it avoids the extra expense of
copying items into the internal representation of the ADT, and the
overhead of allocation and construction by new. The use of arrays is
not appropriate when space is at a premium and the symbol table grows
and shrinks markedly, particularly if it is difficult to estimate the
maximum size of the symbol table in advance. If no accurate size
prediction is possible, unused links might waste space in the item
array.
My questions on above text are
What does author mean by "we can use parallel arrays for the links as we did for linked lists" ? What does this statment mean and what are parallel arrays.
What does author mean links are array indices and we replace link references such x= x->l with x=l[x]?
What does author mean by "Another important feature of this approach is that it allows extra arrays (extra information associated with each node) to be added without the tree-manipulation code being changed at all." ?
You appear to have edited the text to take out the useful references. Either that or you have an earlier version of the text.
My third edition states that the index builds are covered in section 9.6, where it covers the process, and the parallel arrays are explained in chapter 3. The parallel arrays are simply storing the payload (the keys and possibly data that are held in the tree) and left/right pointers in three or more separate arrays, using the index to tie them together (x = left[x]). In that case, you may end up with something like:
int leftptr[100];
int rightptr[100];
char *payload[100];
and so on. In that example, node # 74 would have its data stored in payload[74], and the left and right "pointers" (actually indexes) stored in left[74] and right[74] respectively.
This is in contrast to having a single array of structures with the structure holding payload and pointers together (x = x->left;):
struct sNode {
struct sNode *left, right;
char payload[];
};
So, for your specific questions:
Parallel arrays are simply separating the tree structure information from the payload information and using the index to tie together information from those arrays.
Since you're using arrays for the links (and these arrays now hold array indexes rather than pointers), you no longer use x = x->left to move to the left child. Instead you use x = left[x].
The tree manipulation is only interested in the links. By having the links separated from the payload (and other possibly useful information), the code for manipulating tree structure can be simpler.
If you haven't already, you should flip back in the book to the section on linked-lists where he says the technique was used previously (it's probably explained there).
Parallel arrays means we don't have a struct to hold the node information.
struct node {
int data;
struct node *left;
struct node *right;
};
Instead, we have arrays.
int data[SIZE];
int left[SIZE];
int right[SIZE];
These are parallel arrays because we will use the same index to access the data and links. The node is represented in our code by an index, not a pointer. So for node 4, the data is at
data[4];
The left link is at
left[4];
Adding more information at the node can be done by creating yet another array of the same size.
int extra[SIZE];
The extra data for node 4 will be at
extra[4];

how to implement the string key in B+ Tree?

Many b+ tree examples are implemented using integer key, but i had seen some other examples using both integer key and string key, i learned the b+ tree basis, but i don't understand how string key works?
I also use a multi level B-Tree. Having a string lets say test can be seen as an array of [t,e,s,t]. Now think about a tree of trees. Each node can only hold one character for a certain position. You also need to think about a certain key /value array implementation like a growing linked list of arrays, trees or whatever. It also can make the node size dynamic (limited amount of letters).
If all keys fit the leaf, you store it in the leaf. If the leaf gets to big, you can add new nodes.
And now since each node knows its letter and position, you can strip those characters from the keys in the leaf and reconstruct them as you search or if you know the leaf + the position in the leaf.
If you now, after you have created the tree, write the tree in a certain format, you end up having string compression where you store each letter combination (prefix) only once even if it is shared by 1000ends of strings.
Simple compression often results in a 1:10 compression for normal text (in any language!) and in memory in 1:4. And also you can search for any given word (which are the strings in your dictionary you used the B+Tree for.
This is one extrem where you can use multilevel.
Databases usually use a certain prefix tree (the first x characters and store the rest in the leafs and use binary search within the leaf). Also there are implementations that use variable prefix lengths based on the actual density. So in the end it is very implementation specific and a lot of options exist.
If the tree should aid in finding the exact string. Often adding the length and using hash of lower bits of each characters do the trick. For example you could generate a hash out of length(8bit) + 4bit * 6 characters = 32Bit -> its your hash code. Or you can use the first, last and middle characters along with it. Since the length is one of the most selective you wont find many collisions while search your string.
This solution is very good for finding a particular string but destroyes the natural order of the strings so giving you no chance of answering range queries and alike. But for times where you search for a particular username / email or address those tree would be supperior (but question is why not use a hashmap).
The string key can be a pointer to a string (very likely).
Or the key could be sized to fit most strings. 64 bits holds 8 byte strings and even 16 byte keys aren't too ridiculous.
Choosing a key really depends on how you plan to use it.

Resources