As per my understanding,
Binary heap(data structure) is used to represent Priority queue ADT. It is a complete binary tree satisfying heap property.
Heap property - If A is a parent node of B then the key (the value) of node A is ordered with respect to the key of node B with the same ordering applying across the heap.
Firstly, it helps me remember term heap, if there is a reason behind terming this data structure as heap. Because, we also use the term heap memory.
Dictionary meaning of heap - an untidy collection of things piled up haphazardly.
Question,
After learning Reb-Black tree & AVL tree data structure,
Why do we think of new data structure(Binary heap)?
Does Binary Heap solve set of problems that Red-Black or AVL tree does not fit into?
The major difference between a binary heap and a red-black tree is the performance on certain operations.
Binary Heap
Pros
It makes an ideal priority queue, since the min/max element (depending on your implementation) is always O(1) access time, so no need to search for it.
It's also very fast for insertion of new values (O(1) on average, O(log(n)) worst case.
Cons
Slow searches for random elements.
RB Tree
Pros
Better searching and insertion performance.
Cons
Slower min/max searches.
More overhead in general.
It should be noted that RB trees can make good schedulers too, such as the Completely Fair Scheduler introduced in Linux kernel v2.6.
Related
My goal to find what is the best data structure priority queue (PQ) given the following requirements:
The workload: Initially, we insert to the PQ a number of k elements. Next, there are N operation, where N>>k and very large. In each operation we take the minimal value of the PQ, and update it into a larger of into a larger value.
The data structure should be used for multithreading. This means, finding the minimal value and updating it into a larger value can be parallel. Of course, given L threads, the algorithm should work at O(N/L * log N) time.
The data structure should be cache efficient. This means, each thread should have locality of reference and be fast.
The data structure is not need to be space effect, although it is recommended to be one.
Now, there are serval PQ I can used, but all have problems:
A tree based solution (AVL, balanced tree, RB tree) or the skip-list solution are using link-lists, and thus not cache efficient. This is, as well Fibonacci/Pairing heaps.
The base heap solution is not cache efficient
The d-ary heap solution is not multithreaded easily.
My solution: I am thinking about using the d-ary heap solution, with a multi-threaded algorithm, where the SWAP operation is locked if it is already accessed by another thread. But I am not sure it will work.
Your thoughts?
I heard some statements like, consider the height of the AVL tree and the maximum keys that an AVL tree node can contain, the search of AVL tree will be time-consuming because of the disk io.
However, imagine that an index file contains the whole AVL tree structure, and then the size of the index file is less than a fan size, we can just read the whole AVL tree in only once disk io.
It seems like using AVL tree does not bring about extra disk io, how do you explain B tree is better?
Databases use balanced binary trees(plus) avl is only a special case of these balanced trees, so there is no need for it
we can just read the whole AVL tree in only once disk io
Yes, it could work like that. Essentially, the whole data structure would be brought into memory. IO would no longer be a concern.
Some databases use this strategy. For example, SQL Server In-Memory "Hekaton" does this and delivers ~100x the normal throughput for OLTP.
Hekaton uses two index data structues: hash tables and trees. I think the trees are called cw-trees and are similar to b-trees.
For general purpose database workloads it is very desirable to not need everything in memory. B-trees are a great design tradeoff in those cases.
Its coz B-Trees usually have larger number of keys in single node and hence reducing the depth of the search, in record indexing the link traversal time is longer if the depth is more, hence for cache locality and making the tree wider than deeper, multiple keys are stored in array of a node which improves cache performance and quick lookup comparatively.
I'm trying to compare the growth rates (both run-time and space) for stack and queue operations when implemented as both arrays and as linked lists. So far I've only been able to find average case run-times for queue pop()s, but nothing that comprehensively explores these two data structures and compares their run-times/space behaviors.
Specifically, I'm looking to compare push() and pop() for both queues and stacks, implemented as both arrays and linked lists (thus 2 operations x 2 structures x 2 implementations, or 8 values).
Additionally, I'd appreciate best, average and worst case values for both of these, and anything relating to the amount of space they consume.
The closest thing I've been able to find is this "mother of all cs cheat sheets" pdf that is clearly a masters- or doctoral-level cheat sheet of advanced algorithms and discrete functions.
I'm just looking for a way to determine when and where I should use an array-based implementation vs. a list-based implementation for both stacks and queues.
There are multiple different ways to implement queues and stacks with linked lists and arrays, and I'm not sure which ones you're looking for. Before analyzing any of these structures, though, let's review some important runtime considerations for the above data structures.
In a singly-linked list with just a head pointer, the cost to prepend a value is O(1) - we simply create the new element, wire its pointer to point to the old head of the list, then update the head pointer. The cost to delete the first element is also O(1), which is done by updating the head pointer to point to the element after the current head, then freeing the memory for the old head (if explicit memory management is performed). However, the constant factors in these O(1) terms may be high due to the expense of dynamic allocations. The memory overhead of the linked list is usually O(n) total extra memory due to the storage of an extra pointer in each element.
In a dynamic array, we can access any element in O(1) time. We can also append an element in amortized O(1), meaning that the total time for n insertions is O(n), though the actual time bounds on any insertion may be much worse. Typically, dynamic arrays are implemented by having most insertions take O(1) by appending into preallocated space, but having a small number of insertions run in Θ(n) time by doubling the array capacity and copying elements over. There are techniques to try to reduce this by allocating extra space and lazily copying the elements over (see this data structure, for example). Typically, the memory usage of a dynamic array is quite good - when the array is completely full, for example, there is only O(1) extra overhead - though right after the array has doubled in size there may be O(n) unused elements allocated in the array. Because allocations are infrequent and accesses are fast, dynamic arrays are usually faster than linked lists.
Now, let's think about how to implement a stack and a queue using a linked list or dynamic array. There are many ways to do this, so I will assume that you are using the following implementations:
Stack:
Linked list: As a singly-linked list with a head pointer.
Array: As a dynamic array
Queue:
Linked list: As a singly-linked list with a head and tail pointer.
Array: As a circular buffer backed by an array.
Let's consider each in turn.
Stack backed by a singly-linked list. Because a singly-linked list supports O(1) time prepend and delete-first, the cost to push or pop into a linked-list-backed stack is also O(1) worst-case. However, each new element added requires a new allocation, and allocations can be expensive compared to other operations.
Stack backed by a dynamic array. Pushing onto the stack can be implemented by appending a new element to the dynamic array, which takes amortized O(1) time and worst-case O(n) time. Popping from the stack can be implemented by just removing the last element, which runs in worst-case O(1) (or amortized O(1) if you want to try to reclaim unused space). In other words, the most common implementation has best-case O(1) push and pop, worst-case O(n) push and O(1) pop, and amortized O(1) push and O(1) pop.
Queue backed by a singly-linked list. Enqueuing into the linked list can be implemented by appending to the back of the singly-linked list, which takes worst-case time O(1). Dequeuing can be implemented by removing the first element, which also takes worst-case time O(1). This also requires a new allocation per enqueue, which may be slow.
Queue backed by a growing circular buffer. Enqueuing into the circular buffer works by inserting something at the next free position in the circular buffer. This works by growing the array if necessary, then inserting the new element. Using a similar analysis for the dynamic array, this takes best-case time O(1), worst-case time O(n), and amortized time O(1). Dequeuing from the buffer works by removing the first element of the circular buffer, which takes time O(1) in the worst case.
To summarize, all of the structures support pushing and popping n elements in O(n) time. The linked list versions have better worst-case behavior, but may have a worse overall runtime because of the number of allocations performed. The array versions are slower in the worst-case, but have better overall performance if the time per operation isn't too important.
These aren't the only ways you can implement lists. You could have an unrolled linked list, where each linked list cell holds multiple values. This slightly increases the locality of reference of the lookups and decreases the number of allocations used. Other options (using a balanced tree keyed by index, for example) represent a different set of tradeoffs.
Sorry if I misunderstood your question, but if I didn't, than I believe this is the answer you are looking for.
With a vector, you can only efficiently add/delete elements at the end of the container.
With a deque, you can efficiently add/delete elements at the beginning/end of the container.
With a list, you can efficiently insert/delete elements anywhere in the container.
vectors/deque allow for random access iterators.
lists only allow sequential access.
How you need to use and store the data is how you determine which is most appropriate.
EDIT:
There is a lot more to this, my answer is very generalized. I can go into more depth if I'm even on track of what your question is about.
Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.
Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.
The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?
It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.
Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.
There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.
Dan Gusfield's book is the Bible of string algorithms.
I don't think there a reason to be worried about the worst case for two reasons:
You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.
If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.
That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.
In my experience there are three implementation that I think could met your requirement:
HAT-Trie (combination between trie and hashtable)
JudyArray (compressed n-ary tree)
Double Array Tree
You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.
In a b-tree you can store both keys and data in the internal and leaf nodes, but in a b+ tree you have to store the data in the leaf nodes only.
Is there any advantage of doing the above in a b+ tree?
Why not use b-trees instead of b+ trees everywhere, as intuitively they seem much faster?
I mean, why do you need to replicate the key (data) in a b+ tree?
The image below helps show the differences between B+ trees and B trees.
Advantages of B+ trees:
Because B+ trees don't have data associated with interior nodes, more keys can fit on a page of memory. Therefore, it will require fewer cache misses in order to access data that is on a leaf node.
The leaf nodes of B+ trees are linked, so doing a full scan of all objects in a tree requires just one linear pass through all the leaf nodes. A B tree, on the other hand, would require a traversal of every level in the tree. This full-tree traversal will likely involve more cache misses than the linear traversal of B+ leaves.
Advantage of B trees:
Because B trees contain data with each key, frequently accessed nodes can lie closer to the root, and therefore can be accessed more quickly.
The principal advantage of B+ trees over B trees is they allow you to pack in more pointers to other nodes by removing pointers to data, thus increasing the fanout and potentially decreasing the depth of the tree.
The disadvantage is that there are no early outs when you might have found a match in an internal node. But since both data structures have huge fanouts, the vast majority of your matches will be on leaf nodes anyway, making on average the B+ tree more efficient.
B+Trees are much easier and higher performing to do a full scan, as in look at every piece of data that the tree indexes, since the terminal nodes form a linked list. To do a full scan with a B-Tree you need to do a full tree traversal to find all the data.
B-Trees on the other hand can be faster when you do a seek (looking for a specific piece of data by key) especially when the tree resides in RAM or other non-block storage. Since you can elevate commonly used nodes in the tree there are less comparisons required to get to the data.
In a B tree search keys and data are stored in internal or leaf nodes. But in a B+-tree data is stored only in leaf nodes.
Full scan of a B+ tree is very easy because all data are found in leaf nodes. Full scan of a B tree requires a full traversal.
In a B tree, data may be found in leaf nodes or internal nodes. Deletion of internal nodes is very complicated. In a B+ tree, data is only found in leaf nodes. Deletion of leaf nodes is easy.
Insertion in B tree is more complicated than B+ tree.
B+ trees store redundant search keys but B tree has no redundant value.
In a B+ tree, leaf node data is ordered as a sequential linked list but in a B tree the leaf node cannot be stored using a linked list. Many database systems' implementations prefer the structural simplicity of a B+ tree.
Example from Database system concepts 5th
B+-tree
corresponding B-tree
Adegoke A, Amit
I guess one crucial point you people are missing is difference between data and pointers as explained in this section.
Pointer : pointer to other nodes.
Data :- In context of database indexes, data is just another pointer to real data (row) which reside somewhere else.
Hence in case of B tree each node has three information keys, pointers to data associated with the keys and pointer to child nodes.
In B+ tree internal node keep keys and pointers to child node while leaf node keep keys and pointers to associated data. This allows more number of key for a given size of node. Size of node is determined mainly by block size.
Advantage of having more key per node is explained well above so I will save my typing effort.
B+ Trees are especially good in block-based storage (eg: hard disk). with this in mind, you get several advantages, for example (from the top of my head):
high fanout / low depth: that means you have to get less blocks to get to the data. with data intermingled with the pointers, each read gets less pointers, so you need more seeks to get to the data
simple and consistent block storage: an inner node has N pointers, nothing else, a leaf node has data, nothing else. that makes it easy to parse, debug and even reconstruct.
high key density means the top nodes are almost certainly on cache, in many cases all inner nodes get quickly cached, so only the data access has to go to disk.
Define "much faster". Asymptotically they're about the same. The differences lie in how they make use of secondary storage. The Wikipedia articles on B-trees and B+trees look pretty trustworthy.
In B+ Tree, since only pointers are stored in the internal nodes, their size becomes significantly smaller than the internal nodes of B tree (which store both data+key).
Hence, the indexes of the B+ tree can be fetched from the external storage in a single disk read, processed to find the location of the target. If it has been a B tree, a disk read is required for each and every decision making process. Hope I made my point clear! :)
**
The major drawback of B-Tree is the difficulty of Traversing the keys
sequentially. The B+ Tree retains the rapid random access property of
the B-Tree while also allowing rapid sequential access
**
ref: Data Structures Using C// Author: Aaro M Tenenbaum
http://books.google.co.in/books?id=X0Cd1Pr2W0gC&pg=PA456&lpg=PA456&dq=drawback+of+B-Tree+is+the+difficulty+of+Traversing+the+keys+sequentially&source=bl&ots=pGcPQSEJMS&sig=F9MY7zEXYAMVKl_Sg4W-0LTRor8&hl=en&sa=X&ei=nD5AUbeeH4zwrQe12oCYAQ&ved=0CDsQ6AEwAg#v=onepage&q=drawback%20of%20B-Tree%20is%20the%20difficulty%20of%20Traversing%20the%20keys%20sequentially&f=false
The primary distinction between B-tree and B+tree is that B-tree eliminates the redundant storage of search key values.Since search keys are not repeated in the B-tree,we may not be able to store the index using fewer tree nodes than in corresponding B+tree index.However,since search key that appear in non-leaf nodes appear nowhere else in B-tree,we are forced to include an additional pointer field for each search key in a non-leaf node.
Their are space advantages for B-tree, as repetition does not occur and can be used for large indices.
Take one example - you have a table with huge data per row. That means every instance of the object is Big.
If you use B tree here then most of the time is spent scanning the pages with data - which is of no use. In databases that is the reason of using B+ Trees to avoid scanning object data.
B+ Trees separate keys from data.
But if your data size is less then you can store them with key which is what B tree does.
A B+tree is a balanced tree in which every path from the root of the tree to a leaf is of the same length, and each nonleaf node of the tree has between [n/2] and [n] children, where n is fixed for a particular tree. It contains index pages and data pages.
Binary trees only have two children per parent node, B+ trees can have a variable number of children for each parent node
One possible use of B+ trees is that it is suitable for situations
where the tree grows so large that it does not fit into available
memory. Thus, you'd generally expect to be doing multiple I/O's.
It does often happen that a B+ tree is used even when it in fact fits into
memory, and then your cache manager might keep it there permanently. But
this is a special case, not the general one, and caching policy is a
separate from B+ tree maintenance as such.
Also, in a B+ tree, the leaf pages are linked together in
a linked list (or doubly-linked list), which optimizes traversals
(for range searches, sorting, etc.). So the number of pointers is
a function of the specific algorithm that is used.