Optimizing AVLTree with B-tree - database

PREMISE
So lately i have been thinking of a problem that is common to databases: trying to optimize insertion, search, deletion and update of data.
Usually i have seen that most of the databases nowadays use the BTree or B+Tree to solve such a problem, but they are usually used to store data inside the disk and i wanted to work with in-memory data, so i thought about using the AVLTree (the difference should be minimal because the purpose of the BTrees is kind of the same of the AVLTree but the implementation is different and so are the effects).
Before continuing with the reasoning behind this i would like to get in a deeper level of what i am trying to solve.
So in a modern database data stored in a table with a PRIMARY KEY which tends to be INDEXED (i am not very experienced in indexing so what i will say is basic reasoning i put into this problem), usually the PRIMARY KEY is an increasing number (even though nowadays is a bad practice) starting from 1.
Using normally an AVLTree should be more then enough to solve the problem cause this particular tree is always balanced and offers O(log2(n)) operations, BUT i wanted to reach this on a deeper level trying to optimize it even more then needed.
THEORY
So as the title of the question suggests i am trying to optimize the AVLTree merging it with a Btree.
Basically every node of this new Tree is lets say an array of ten elements every node as also the corresponding height in the tree and every element of the array is ordered ascending.
INSERTION
The insertion initally fills the array of the root node when the root node is full it generates the left and right children which also contains an array of 10 elements.
Whenever a new node is added the Tree autorebalances the nodes based on the first key of the vectors of the left and right child using also their height (note that this is actually how the AVLTree behaves but the AVLTree only has 2 nodes and no vector just the values).
SEARCH
Searching an element works this way: staring from the root we compare the value we are searching K with the first and last key of the array of the current node if the value is in between, we know that it surely will be in the array of the current node so we can start using a binarySearch with O(log2(n)) complexity into this array of ten elements, otherise we go on the left if the key we are searcing is smaller then the first key or we go to the right if it is bigger.
DELETION
The same of the searching but we delete the value.
UPDATE
The same of the searching but we update the value.
CONCLUSION
If i am not wrong this should have a complexity of O(log10(log2(10))) which is always logarithmic so we shouldn't care about this optimization, but in my opinion this could make the height of the tree so much smaller while providing also quick time on the search.

B tree and B+ tree are indeed used for disk storage because of the block design. But there is no reason why they could not be used also as in-memory data structure.
The advantages of a B tree include its use of arrays inside a single node. Look-up in a limited vector of maybe 10 entries can be very fast.
Your idea of a compromise between B tree and AVL would certainly work, but be aware that:
You need to perform tree rotations like in AVL in order to keep the tree balanced. In B trees you work with redistributions, merges and splits, but no rotations.
Like with AVL, the tree will not always be perfectly balanced.
You need to describe what will be done when a vector is full and a value needs to be added to it: the node will have to split, and one half will have to be reinjected as a leaf.
You need to describe what will be done when a vector gets a very low fill-factor (due to deletions). If you leave it like that, the tree could degenerate into an AVL tree where every vector only has 1 value, and then the additional vector overhead will make it less efficient than a genuine AVL tree. To keep the fill-factor of a vector above a minimum you cannot easily apply the redistribution mechanism with a sibling node, as would be done in B-trees. It would work with leaf nodes, but not with internal nodes. So this needs to be clarified...
You need to describe what will be done when a value in a vector is updated. Of course, you would insert it in its sorted position: but if it becomes the first or last value in that vector, this may violate the order with regards to left and right children, and so also there you may need to define more precisely the algorithm.
Binary search in a vector of 10 may be overkill: a simple left-to-right scan may be faster, as CPUs are optimised to read consecutive memory. This does not impact the time complexity, since we set that the vector size is limited to 10. So we are talking about doing either at most 4 comparisons (3-4 on average depending on binary search implementation) or at most 10 comparisons (5 on average).
If I am not wrong this should have a complexity of O(log10(log2(n))) which is always logarithmic
Actually, if that were true, it would be sub-logarithmic, i.e. O(loglogn). But there is a mistake here. The binary search in a vector is not related to n, but to 10. Also, this work comes in addition to finding the node with that vector. So it is not a logarithm of a logarithm, but a sum of logarithms:
O(log10n + log210) = O(log n)
Therefore the time complexity is no different than the one for AVL or B-tree -- provided that the algorithm is completed with the missing details, keeping within the logarithmic complexity.
You should maybe also consider to implement a pure B tree or B+ tree: that way you also benefit from some of the advantages that neither the AVL, nor the in-between structure has:
The leaves of the tree are all at the same level
No rotations are needed
The tree height only changes at one spot: the root.
B+ trees provide a very fast mean for iterating all values in their order.

Related

How searching a million keys organized as B-tree will need 114 comparisons?

Please explain how it will take 114 comparisons. The following is the screenshot taken from my book (Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press). My reasoning is that in worst case each node will have just minimum number of children (i.e. 5), so I took log base 5 of a million, and it comes to 9. So assuming at each level of the tree we search minimum number of keys (i.e. 4), it comes to somewhere like 36 comparisons, nowhere near 114.
Consider a situation in which we have to search an un-indexed and
unsorted database that contains n key values. The worst case running
time to perform this operation would be O(n). In contrast, if the data
in the database is indexed with a B tree, the same search operation
will run in O(log n). For example, searching for a single key on a set
of one million keys will at most require 1,000,000 comparisons. But if
the same data is indexed with a B tree of order 10, then only 114
comparisons will be required in the worst case.
Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press
The worst case tree has the minimum number of keys everywhere except on the path you're searching.
If the size of each internal node is in [5,10), then in the worst case, a tree with a million items will be about 10 levels deep, when most nodes have 5 keys.
The worst case path to a node, however, might have 10 keys in each node. The statement seems to assume that you'll do a linear search instead of a binary search inside each node (I would advise to do a binary search instead), so that can lead to around 10*10 = 100 comparisons.
If you carefully consider the details, the real number might very well come out to 114.
(This is not an Answer to the question asked, but a related discussion.)
Sounds like a textbook question, not a real-life question.
Counting comparisons is likely to be the best way to judge an in-memory tree, but not for a disk-based dataset.
Even so, the "average" number of comparisons (for in-memory) or disk hits (for disk-based) is likely to be the metric to compute.
(Sure, it is good to compute the maximum numbers as a useful exercise for understanding the structures.)
Perhaps the optimal "tree" for in memory searching is a Binary tree, but with 3-way fan out. And keep the tree balanced with 2 or 3 elements in each node.
For disk based searching -- think databases -- the optimal is likely to be a BTree with the size of a block being based on what is efficient to read from disk. Counting comparisons in a poor second when it comes to the overall time taken to fetch a row.

Better algorithm for searching through an array of points?

I have an array of structs, where each struct is a 2D position (pair of 32-bit values). This array is for tracking points of interest on a map.
struct Point {
int x;
int y;
};
// ...
struct Point pointsOfInterest[1024];
The problem is, those points of interest are constantly changing, meaning the entries in the array are very frequently being added or removed. On top of that, each reported point of interest can possibly already exist in the array, so I can't blindly add new ones without checking whether they already exist.
At the moment the array is unsorted (new entries added to the end, swap and pop to remove), and I iterate over the entire list to find entries for removal or duplication check. I'd like to know what my options are for speeding up this process.
In other languages, this is where I break out a dictionary or hash set. Neither exist in C, so I have to weigh the complexity of adding something like that.
I've considered sorting the list (i.e. first by X, then Y). But given the frequency of updates, I feel like I'll be thrashing the table far more than when iterating. But my knowledge of sorting algorithms is minimal.
Would a binary tree of some sort be any better here? Or would I again be spending all of my time re-balancing the tree?
Theoretically, given the (perceived) complexity of these algorithms, is there a threshold below which a linear search remains a viable option?
I'm assuming this is a known solved problem, so I'm hoping to get pointed in the right direction before I spend a lot of time reinventing the wheel and testing possible solutions.
Apart from trivial cases, it's often extremely hard to predict where the performance gains are. That's why you should benchmark your code before and after changes. Also profile your code to find where it spends most time.
In other languages, this is where I break out a dictionary or hash set. Neither exist in C, so I have to weigh the complexity of adding something like that.
TBH, it's not that complicated to implement. If you need the performance, it's a no brainer. But it's not guaranteed that it will be faster.
I've considered sorting the list (i.e. first by X, then Y). But given the frequency of updates, I feel like I'll be thrashing the table far more than when iterating. But my knowledge of sorting algorithms is minimal.
It's very likely that this is not optimal. But you can try it out. And you don't need to do a complete sort. Just do a binary search and move everything that comes after.
Would a binary tree of some sort be any better here? Or would I again be spending all of my time re-balancing the tree?
Only one way to find out. Try it and benchmark.
Theoretically, given the (perceived) complexity of these algorithms, is there a threshold below which a linear search remains a viable option?
I'm sure there are, but these always have to be balanced with reality. Like cache misses that can have a great impact on performance. One thing that might improve cache-friendlyness could be changing
struct Point {
int x;
int y;
};
struct Point pointsOfInterest[1024];
to
int pointsOfInterest[2][1024];
And use the first index for x or y. Might work, depending on what you're doing with the data. I guess it would not work in your case, but it could speed up a function that's only loops over one dimension.

BST of minimal height

I'm trying to solve the following problem: "Given a sorted (increasing order) array with unique integer elements, write an algorithm to create a BST with minimal height."
The given answer takes the root node to be the middle of the array. While doing this makes sense to me intuitively, I'm trying to prove, rigorously, that it's always best to make the root node the middle of the array.
The justification given in the book is: "To create a tree of minimal height, we need to match the number of nodes in the left subtree to the number of nodes in the right subtree as much as possible. This means that we want the root node to be the middle of the array, since this would mean that half the elements would be less than the root and half would be greater."
I'd like to ask:
Why would any tree of minimal height be one where the number of nodes in the left subtree be as equal as possible to the number of nodes in the right subtree? (Or, do you have any other way to prove that it's best to make the root node the middle of the array?)
Is a tree with minimal height the same as a tree that's balanced? From a previous question on SO, that's the impression I got, (Visualizing a balanced tree) but I'm confused because the book specifically states "BST with minimal height" and never "balanced BST".
Thanks.
Source: Cracking the Coding Interview
The way I like to think about it, if you balance a tree using tree rotations (zig-zig and zig-zag rotations) you will eventually reach a state in which the left and right subtree differ by at most height of one. It is not always the case that a balanced tree must have the same number of children on the right and the left; however, if you have that invariant(same # of children on each side), you can reach a tree that is balanced using tree rotations)
Balance is defined arbitrarily. AVL trees define it in such as way that no subtree of the tree has a children whose heights differ by more than one. Other trees define balance in different ways, so they are not the same distinction. They are inherently related yet not exactly the same. That being said, a tree of minimal height will always be balanced under any definition since balancing exists to maintain a O(log(n)) lookup time of the BST.
If I missed anything or said anything wrong, feel free to edit/correct me.
Hope this helps
Why would any tree of minimal height be one where the number of nodes
in the left subtree be as equal as possible to the number of nodes in
the right subtree?
There can be a scenario where in minimal height tree which are ofcourse balanced can have different number of node count on left and right hand side. BST worst case traversal is O(n) in case if it is sorted and in minimal height trees the complexity for worst case is O(log n).
*
/ \
* *
/
*
Here you can clearly see that left node count and right nodes are not equal though it is a minimal height tree.
Is a tree with minimal height the same as a tree that's balanced? From a previous question on SO, that's the impression I got, (Visualizing a balanced tree) but I'm confused because the book specifically states "BST with minimal height" and never "balanced BST".
Minimal height tree is balanced one, for more details you can take a look on AVL trees which are also known as height-balanced trees. While making BST a height balanced tree you have to perform rotations (LR, RR, LL, RL).

What is the difference between modified preorder and just preorder?

I've read about tree traversals and tree data structures so i now know what a preorder tree traversal is, but i also see that there is something called a modified preorder tree traversal but im finding it a little difficult to find good answers or documentation about what the difference between these two are. Can someone comment on that ? I did find an article about explaining it but the diagram looked similar to the regular preorder, the only thing that author wrote was that a node had two additional values, and im not sure if that is correct or not.
The article I read: http://imrannazar.com/Modified-Preorder-Tree-Traversal
I also looked through this: http://www.sitepoint.com/hierarchical-data-database-2/ but I'm having a hard time trusting the author who says he is an economics major in college writing about tree structures.
Django's mptt module is one place its being used: http://django-mptt.github.io/django-mptt/overview.html#what-is-modified-preorder-tree-traversal
It seems like the modified version of this preorder tree traversal is being used several places when I search on google so I find it a little weird that there aren't more articles that explain the differences.
I disagree slightly with the naming, and perhaps there's some confusion from your side regarding this.
I'd define "traversal" as the process of traversing.
The result we get from the traversal I'd perhaps call a "representation".
MPTT is more of a representation than a traversal.
Additionally, the value on the left can be thought of as preorder, while the value on the right can be thought of as postorder (we'll get to that...).
Thus I might have rather named it "Combined Pre-/postorder traversal representation".
Now onto what this actually is.
As mentioned above, it's merely a representation.
Let's look at an algorithm to generate this representation from a tree:
traverse(node, index)
node.left = index
for each child c of node
traverse(c, ++index)
node.right = ++index
As can be seen with the above code, we do something with the node both before and after recursing to the children, so it can be seen as some combination of pre- and postorder traversal.
Now where the more significant difference between this and pre- or postorder comes in, is how it's used.
Pre- or postorder is typically something you run on a tree or use to generate a tree from (perhaps to compactly store it to disk, while a typical pointer-based representation might be generated in memory when you actually want to use the tree).
But with MPTT, this would be how you actually represent the tree while using it. This is especially useful in databases, which aren't particularly focussed around recursion.
You'd store the MPTT values in your table, and these would allow you to efficiently perform various queries. For example: (derived from here)
To get all descendants of a node, you'd look for left values between the left and right values of that node.
To get the path to / all ancestors of a node, you'd look for nodes with left values smaller than this left value and right values greater than this right value.
Both of the above can be performed using a single query, where-as a recursive representation will require one query for each node in the path and ... a few for the descendants.

Best and easiest algorithm to search for a vertex on a Graph?

After implementing most of the common and needed functions for my Graph implementation, I realized that a couple of functions (remove vertex, search vertex and get vertex) don't have the "best" implementation.
I'm using adjacency lists with linked lists for my Graph implementation and I was searching one vertex after the other until it finds the one I want. Like I said, I realized I was not using the "best" implementation. I can have 10000 vertices and need to search for the last one, but that vertex could have a link to the first one, which would speed up things considerably. But that's just an hypothetical case, it may or may not happen.
So, what algorithm do you recommend for search lookup? Our teachers talked about Breadth-first and Depth-first mostly (and Dikjstra' algorithm, but that's a completely different subject). Between those two, which one do you recommend?
It would be perfect if I could implement both but I don't have time for that, I need to pick up one and implement it has the first phase deadline is approaching...
My guess, is to go with Depth-first, seems easier to implement and looking at the way they work, it seems a best bet. But that really depends on the input.
But what do you guys suggest?
If you’ve got an adjacency list, searching for a vertex simply means traversing that list. You could perhaps even order the list to decrease the needed lookup operations.
A graph traversal (such as DFS or BFS) won’t improve this from a performance point of view.
Finding and deleting nodes in a graph is a "search" problem not a graph problem, so to make it better than O(n) = linear search, BFS, DFS, you need to store your nodes in a different data structure optimized for searching or sort them. This gives you O(log n) for find and delete operations. Candidatas are tree structures like b-trees or hash tables. If you want to code the stuff yourself I would go for a hash table which normally gives very good performance and is reasonably easy to implement.
I think BFS would usually be faster an average. Read the wiki pages for DFS and BFS.
The reason I say BFS is faster is because it has the property of reaching nodes in order of their distance from your starting node. So if your graph has N nodes and you want to search for node N and node 1, which is the node you start your search form, is linked to N, then you will find it immediately. DFS might expand the whole graph before this happens however. DFS will only be faster if you get lucky, while BFS will be faster if the nodes you search for are close to your starting node. In short, they both depend on the input, but I would choose BFS.
DFS is also harder to code without recursion, which makes BFS a bit faster in practice, since it is an iterative algorithm.
If you can normalize your nodes (number them from 1 to 10 000 and access them by number), then you can easily keep Exists[i] = true if node i is in the graph and false otherwise, giving you O(1) lookup time. Otherwise, consider using a hash table if normalization is not possible or you don't want to do it.
Depth-first search is best because
It uses much less memory
Easier to implement
the depth first and breadth first algorithms are almost identical, except for the use of a stack in one (DFS), a queue in the other (BFS), and a few required member variables. Implementing them both shouldn't take you much extra time.
Additionally if you have an adjacency list of the vertices then your look up with be O(V) anyway. So little to nothing will be gained via using one of the two other searches.
I'd comment on Konrad's post but I can't comment yet so... I'd like to second that it doesn't make a difference in performance if you implement DFS or BFS over a simple linear search through your list. Your search for a particular node in the graph doesn't depend on the structure of the graph, hence it's not necessary to confine yourself to graph algorithms. In terms of coding time, the linear search is the best choice; if you want to brush up your skills in graph algorithms, implement DFS or BFS, whichever you feel like.
If you are searching for a specific vertex and terminating when you find it, I would recommend using A*, which is a best-first search.
The idea is that you calculate the distance from the source vertex to the current vertex you are processing, and then "guess" the distance from the current vertex to the target.
You start at the source, calculate the distance (0) plus the guess (whatever that might be) and add it to a priority queue where the priority is distance + guess. At each step, you remove the element with the smallest distance + guess, do the calculation for each vertex in its adjacency list and stick those in the priority queue. Stop when you find the target vertex.
If your heuristic (your "guess") is admissible, that is, if it's always an under-estimate, then you are guaranteed to find the shortest path to your target vertex the first time you visit it. If your heuristic is not admissible, then you will have to run the algorithm to completion to find the shortest path (although it sounds like you don't care about the shortest path, just any path).
It's not really any more difficult to implement than a breadth-first search (you just have to add the heuristic, really) but it will probably yield faster results. The only hard part is figuring out your heuristic. For vertices that represent geographical locations, a common heuristic is to use an "as-the-crow-flies" (direct distance) heuristic.
Linear search is faster than BFS and DFS. But faster than linear search would be A* with the step cost set to zero. When the step cost is zero, A* will only expand the nodes that are closest to a goal node. If the step cost is zero then every node's path cost is zero and A* won't prioritize nodes with a shorter path. That's what you want since you don't need the shortest path.
A* is faster than linear search because linear search will most likely complete after O(n/2) iterations (each node has an equal chance of being a goal node) but A* prioritizes nodes that have a higher chance of being a goal node.

Resources