Why is B+-tree preferred over fibonacci heap in database-system implementation? - database

Is there a particular reason the B+-tree is preferred, when implementing a larger scale database-system, over the Fibonacci heap? From the complexity analysis in the image it would seem that Fibonacci heap is faster.

A B+ tree is a search tree and not a heap. The image is not comparing a Fibonacci heap with a B+ tree, but with a binary heap.
Comparing heap and search tree
A heap is a data structure that provides a lazy order, i.e. to get the ith value in sorted order, you will have to alter the heap as you pop values from it. This is true for both heap implementations in the image you shared.
A search tree has a stronger focus on order. You can iterate its values in order in O(n) time without making any change to the tree. For a heap that would amount to O(nlogn), as you would need n extract-min operations, and the heap loses the values you extract from it.
You wrote:
Is there a particular reason the B+-tree is preferred, when implementing a larger scale database-system
A heap is not a useful data structure for indexing data in database systems, as the order is not known without alteration, and the nodes, when read in ordered sequence, are scattered at different disk locations.
A search tree is a better fit for this purpose. Among search trees, those that go well with larger block sizes are interesting choices for databases that have their data on relatively slower disks. A B-tree is such an example. B+-trees have as advantage over B-trees that they store values in order within linked leaf-blocks, so that they optimise on ordered iteration, while B-trees take a little bit less space than B+-trees.
Comparing binary heap and Fibonacci heap
The difference in time complexity between a binary heap and Fibonacci heap could be a factor to go with the Fibonacci heap. But as a Fibonacci heap has larger overhead, the gain would only appear for larger data sets. On Wikipedia it says:
Although Fibonacci heaps look very efficient, they have the following
two drawbacks (as mentioned in the paper "The Pairing Heap: A new form
of Self Adjusting Heap"):
"They are complicated when it comes to
coding them. Also they are not as efficient in practice when compared
with the theoretically less efficient forms of heaps, since in their
simplest version they require storage and manipulation of four
pointers per node, compared to the two or three pointers per node
needed for other structures ".
These other structures are referred
to Binary heap, Binomial heap, Pairing Heap, Brodal Heap and Rank
Pairing Heap.
Although the total running time of a sequence of operations starting
with an empty structure is bounded by the bounds given above, some
(very few) operations in the sequence can take very long to complete
(in particular delete and delete minimum have linear running time in
the worst case). For this reason Fibonacci heaps and other amortized
data structures may not be appropriate for real-time systems.

Related

All purpose of binary heap

Definition:
A priority queue is an abstract data type which is like a regular queue or stack data structure, but where additionally each element has a "priority" associated with it. In a priority queue, an element with high priority is served before an element with low priority. If two elements have the same priority, they are served according to their order in the queue.
Implementation:
To implement Priority queue, unsorted array, sorted array and binary heap data structure are the 3 implementation strategies .
To be specific, binary heap implementation strategy can be represented using array of keys,
or
each key as binary node having two children.
Question:
Apart from priority queue implementation, Are their any other applications of using binary heap data structure?
A binary heap can be used to extract (max or min) element in O(logn) time. This property can be exploited to be used in many algorithms to get better run-time.
For example, once I used it in k-merge sort algorithm to increase time efficiency of sorting step of the k-merge sort. In brief, it made binary heaps of the k-subarrays, and the sorting can be achieved in linear time which is better than usual sorting step of a merge sort.
It is also used in Dijkstra's algorithm, Prim's algorithm to decrease their run time.
You can also take a look here
Binary heaps have one other useful (and major) application: HeapSort. HeapSort has higher overhead than QuickSort but its worst case is O(n log n) vs. QuickSort's O(n*n). QuickSort can be improved upon to obtain a worst case of O(n log n) by switching to HeapSort once the interval is sufficiently short -- this is called IntroSort, and is what is used in the STL and the C++ standard library. See https://en.wikipedia.org/wiki/Introsort

Cache Optimization - Hashmap vs QuickSort?

Suppose that I have N unsorted arrays, of integers. I'd like to find the intersection of those arrays.
There are two good ways to approach this problem.
One, I can sort the arrays in place with an nlogn sort, like QuickSort or MergeSort. Then I can put a pointer at the start of each array. Compare each array to the one below it, iterating the pointer of whichever array[pointer] is smaller, or if they're all equal, you've found an intersection.
This is an O(nlogn) solution, with constant memory (since everything is done in-place).
The second solution is to use a hash map, putting in the values that appear in the first array as keys, and then incrementing those values as you traverse through the remaining arrays (and then grabbing everything that had a value of N). This is an O(n) solution, with O(n) memory, where n is the total size of all of the arrays.
Theoretically, the former solution is o(nlogn), and the latter is O(n). However, hash maps do not have great locality, due to the way that items can be randomly scattered through the map, due to collisions. The other solution, although o(nlogn), traverses through the array one at a time, exhibiting excellent locality. Since a CPU will tend to pull the array values from memory that are next to the current index into the cache, the O(nlogn) solution will be hitting the cache much more often than the hash map solution.
Therefore, given a significantly large array size (as number of elements goes to infinity), is it feasible that the o(nlogn) solution is actually faster than the O(n) solution?
For integers you can use a non-comparison sort (see counting, radix sort). A large set might be encoded, e.g. sequential runs into ranges. That would compress the data set and allow for skipping past large blocks (see RoaringBitmaps). There is the potential to be hardware friendly and have O(n) complexity.
Complexity theory does not account for constants. As you suspect there is always the potential for an algorithm with a higher complexity to be faster than the alternative, due to the hidden constants. By exploiting the nature of the problem, e.g. limiting the solution to integers, there are potential optimizations not available to general purpose approach. Good algorithm design often requires understanding and leveraging those constraints.

A Memory-Adaptive Merge Algorithm?

Many algorithms work by using the merge algorithm to merge two different sorted arrays into a single sorted array. For example, given as input the arrays
1 3 4 5 8
and
2 6 7 9
The merge of these arrays would be the array
1 2 3 4 5 6 7 8 9
Traditionally, there seem to be two different approaches to merging sorted arrays (note that the case for merging linked lists is quite different). First, there are out-of-place merge algorithms that work by allocating a temporary buffer for storage, then storing the result of the merge in the temporary buffer. Second, if the two arrays happen to be part of the same input array, there are in-place merge algorithms that use only O(1) auxiliary storage space and rearrange the two contiguous sequences into one sorted sequence. These two classes of algorithms both run in O(n) time, but the out-of-place merge algorithm tends to have a much lower constant factor because it does not have such stringent memory requirements.
My question is whether there is a known merging algorithm that can "interpolate" between these two approaches. That is, the algorithm would use somewhere between O(1) and O(n) memory, but the more memory it has available to it, the faster it runs. For example, if we were to measure the absolute number of array reads/writes performed by the algorithm, it might have a runtime of the form n g(s) + f(s), where s is the amount of space available to it and g(s) and f(s) are functions derivable from that amount of space available. The advantage of this function is that it could try to merge together two arrays in the most efficient way possible given memory constraints - the more memory available on the system, the more memory it would use and (ideally) the better the performance it would have.
More formally, the algorithm should work as follows. Given as input an array A consisting of two adjacent, sorted ranges, rearrange the elements in the array so that the elements are completely in sorted order. The algorithm is allowed to use external space, and its performance should be worst-case O(n) in all cases, but should run progressively more quickly given a greater amount of auxiliary space to use.
Is anyone familiar with an algorithm of this sort (or know where to look to find a description of one?)
at least according to the documentation, the in-place merge function in SGI STL is adaptive and "its run-time complexity depends on how much memory is available". The source code is available of course you could at least check this one.
EDIT: STL has inplace_merge, which will adapt to the size of the temporary buffer available. If the temporary buffer is at least as big as one of the sub-arrays, it's O(N). Otherwise, it splits the merge into two sub-merges and recurses. The split takes O(log N) to find the right part of the other sub array to rotate in (binary search).
So it goes from O(N) to O(N log N) depending on how much memory you have available.

Most Efficient way of implementing a BlackList

I developing a Ip filter and was guessing how i could, using any type of esque data structure, develop a VERY efficient and fast BlackList filter.
What i want to do is simple, every incoming/outcoming connection i have to check in a list of blocked IP´s.
The IPs are scattered, and the memory use should be linear(not dependent of the number of blocked list, because i want to use on limited systems(homebrew routers)).
I have time and could create anything from zero. The difficulty is not important to me.
If you can use anything, what you should do ?
Hashtables are the way to go.
They have averaged O(1) complexity for lookup, insertion and deletion!
They tend to occupy more memory than trees but are much faster.
Since you are just working with 32 bit integer (you can of course convert an IP to a 32 bit integer) things will be amazingly simple and fast.
You can just use a sorted array. Insertion and removal cost is O(n) but lookup is O(log n) and especially memory is just 4 byte for each ip.
The implementation is very simple, perhaps too much :D
Binary trees have complexity of O(log n) for lookup, insertion and deletion.
A simple binary tree would not be sufficient however, you need an AVL tree or a Red Black Tree, that can be very annoying and complicated to implement.
AVL and RBT trees are able to balance themselves, and we need that because an unbalanced tree will have a worst time complexity of O(n) for lookup, that is the same of a simple linked list!
If instead of single and unique ip u need to ban ip ranges, probably you need a Patricia Trie, also called Radix Tree, they were invented for word dictionaries and for ip dictionaries.
However these trees can be slower if not well written\balanced.
Hashtable are always better for simple lookups! They are too fast to be real :)
Now about synchronization:
If you are filling the black list only once at application startup, you can use a plain read only hashtable (or radix tree) that don't have problems about multithreading and locking.
If you need to update it not very often, I would suggest you the use reader-writer locks.
If you need very frequent updates I would suggest you to use a concurrent hashtable.
Warning: don't write your own, they are very complicated and bug prone, find an implementation on the web!
They use a lot the (relatively) new atomic CAS operations of new processors (CAS means Compare and Swap). These are a special set of instructions or sequence of instructions that allow 32 bit or 64 bit fields on memory to be compared and swapped in a single atomic operation without the need of locking.
Using them can be complicated because you have to know very well your processor, your operative system, your compiler and the algorithm itself is counterintuitive.
See http://en.wikipedia.org/wiki/Compare-and-swap for more informations about CAS.
Concurrent AVL tree was invented, but it is so complicated that I really don't know what to say about these :) for example, http://hal.inria.fr/docs/00/07/39/31/PDF/RR-2761.pdf
I just found that concurrent radix tree exists:
ftp://82.96.64.7/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf but it is quite complicated too.
Concurrent sorted arrays doesn't exists of course, you need a reader-writer lock for update.
Consider also that the amount of memory required to handle a non-concurrent hashtable can be quite little: For each IP you need 4 byte for the IP and a pointer.
You need also a big array of pointers (or 32 bit integers with some tricks) which size should be a prime number greater than the number of items that should be stored.
Hashtables can of course also resize themselves when required if you want, but they can store also more item than that prime numbers, at the cost of slower lookup time.
For both trees and hashtable, the space complexity is linear.
I hope this is a multithreading application and not a multiprocess application (fork).
If it is not multithreading you cannot share a portion of memory in a fast and reliable way.
One way to improve the performance of such a system is to use a Bloom Filter. This is a probabilistic data structure, taking up very little memory, in which false positives are possible but false negatives are not.
When you want to look up an IP address, you first check in the Bloom Filter. If there's a miss, you can allow the traffic right away. If there's a hit, you need to check your authoritative data structure (eg a hash table or prefix tree).
You could also create a small cache of "hits in the Bloom Filter but actually allowed" addresses, that is checked after the Bloom Filter but before the authoritative data structure.
Basically the idea is to speed up the fast path (IP address allowed) at the expense of the slow path (IP address denied).
The "most efficient" is a hard term to quantify. Clearly, if you had unlimited memory, you would have a bin for every IP address and could immediately index into it.
A common tradeoff is using a B-tree type data structure. First level bins could be preset for the first 8 bits of the IP address, which could store a pointer to and the size of a list containing all currently blocked IP addresses. This second list would be padded to prevent unnecessary memmove() calls and possibly sorted. (Having the size and the length of the list in memory allows an in-place binary search on the list at the slight expensive of insertion time.)
For example:
127.0.0.1 =insert=> { 127 :: 1 }
127.0.1.0 =insert=> { 127 :: 1, 256 }
12.0.2.30 =insert=> { 12 : 542; 127 :: 1, 256 }
The overhead on such a data structure is minimal, and the total storage size is fixed. The worse case, clearly, would be a large number of IP addresses with the same highest order bits.

Why searching in BST is faster than Binary search algorithm

I wonder why searching in BST is faster than Binary search algorithm.
I am talking about tree that have (almost) always the same numbers of vectors in sub tree (well balanced.)
I have tested both of them and searching in BST is always faster. Why?
It's impossible to know without looking at the implementation. At their core, they are the same thing.
The BST needs to follow pointers to traverse into the right half, whereas binary search on arrays does arithmetic (e.g. addition and division/shift). Usually, the the binary search on arrays is a little faster because it traverses less memory overall (no pointers need to be stored) and it is more cache coherent in the final stages of the algorithm.
If the array variant is always slower for you, there's probably a glitch in the implementation or (but this is very unlikely!!) the arithmetic is a lot slower than all the memory overhead.
Both should be about the same in terms of speed. Both are O(log n). The binary search accesses a memory location and make a comparison at every iteration. The BST follows a pointer (which is also a memory access) and makes a comparison. The difference in constants within their big-O complexity should be negligible.
One possible reason might be the fact that you need to perform an extra calculation during every iteration of the binary search. Most implementations have a line like:
mid=(high+low)/2;
The division operation can be costly compared to integer addition and comparison operations. this might be contributing to the extra performance overhead. One way to reduce the impact would be using:
mid=(high+low)>>1;
But I think most compilers will optimize that for you anyway.
The BST variant does not need to compute anything, it just compares and follows the appropriate pointer.
Also it might be that you are doing your binary search recursively and your BST query non-recursively making the BST faster. But it is really hard to come up with any specific reasons without looking at your code.

Resources