What is the benefit of a binary search tree over a sorted array with binary search? Just with mathematical analysis I do not see a difference, so I assume there must be a difference in the low-level implementation overhead. Analysis of average case run time is shown below.
Sorted array with binary search
search: O(log(n))
insertion: O(log(n)) (we run binary search to find where to insert the element)
deletion: O(log(n)) (we run binary search to find the element to delete)
Binary search tree
search: O(log(n))
insertion: O(log(n))
deletion: O(log(n))
Binary search trees have a worst case of O(n) for operations listed above (if tree is not balanced), so this seems like it would actually be worse than sorted array with binary search.
Also, I am not assuming that we have to sort the array beforehand (which would cost O(nlog(n)), we would insert elements one by one into the array, just as we would do for the binary tree. The only benefit of BST I can see is that it supports other types of traversals like inorder, preorder, postorder.
Your analysis is wrong, both insertion and deletion is O(n) for a sorted array, because you have to physically move the data to make space for the insertion or compress it to cover up the deleted item.
Oh and the worst case for completely unbalanced binary search trees is O(n), not O(logn).
There's not much of a benefit in querying either one.
But constructing a sorted tree is a lot faster than constructing a sorted array, when you're adding elements one at a time. So there's no point in converting it to an array when you're done.
Note also that there are standard algorithms for maintaining balanced binary search trees. They get rid of the deficiencies in binary trees and maintain all of the other strengths. They are complicated, though, so you should learn about binary trees first.
Beyond that, the big-O may be the same, but the constants aren't always. With binary trees if you store the data correctly, you can get very good use of caching at multiple levels. The result is that if you are doing a lot of querying, most of your work stays inside of CPU cache which greatly speeds things up. This is particularly true if you are careful in how you structure your tree. See http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-data-structures.aspx for an example of how clever layout of the tree can improve performance greatly. An array that you do a binary search of does not permit any such tricks to be used.
Adding to #Blindy , I would say the insertion in sorted array takes more of memory operation O(n) std::rotate() than CPU instruction O(logn), refer to insertion sort.
std::vector<MYINTTYPE> sorted_array;
// ... ...
// insert x at the end
sorted_array.push_back(x);
auto& begin = sorted_array.begin();
// O(log n) CPU operation
auto& insertion_point = std::lower_bound(begin()
, begin()+sorted_array().size()-1, x);
// O(n) memory operation
std::rotate(begin, insertion_point, sorted_array.end());
I guess Left child right sibling tree combines the essence of binary tree and sorted array.
data structure
operation
CPU cost
Memory operation cost
sorted array
insert
O(logn) (benefits from pipelining)
O(n) memory operation, refer to insertion-sort using std::rotate()
search
O(logn)
benefits from inline implementation
delete
O(logn) (when pipelining with memory operation)
O(n) memory operation, refer to std::vector::erase()
balanced binary tree
insert
O(logn) (drawback of branch-prediction affecting pipelining, also added cost of tree rotation)
Additional cost of pointers that exhaust the cache.
search
O(logn)
delete
O(logn) (same as insert)
Left child right sibling tree (combines sorted array and binary tree)
insert
O(logn) on average
No need std::rotate() when inserting on left child if kept unbalanced
search
O(logn) (in worst case O(n) when unbalanced)
takes advantage of cache locality in right sibling search , refer to std::vector::lower_bound()
delete
O(logn) (when hyperthreading/pipelining)
O(n) memory operation refer to std::vector::erase()
Related
Background
Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one? Here's an example that will clear things up:
Example
I'm generating N random numbers and want to insert them into a new array as I generate them, and I want the final array to be sorted.
Possible Solutions
Insertion Sort
My gut told me that putting each element in the correct place as it's generated would be fastest. This is accomplished by doing a binary search to find the correct point in the array to insert the new element. However, this is an insertion sort, which is known to be less efficient on large lists than other sorting algorithms.
Quicksort
Quicksort is generally thought of as the most efficient 'general' sorting algorithm, where nothing is known about the inputs to the array, and it's more efficient than insertion sort on large lists. Would it, therefore, be best to simply put the random numbers in the array in an unsorted order, and then sort them at the end with quicksort?
Other Solutions
Is there another algorithm I haven't thought of?
Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one?
It boils down to the same problem for random data, due to efficiency considerations.
Given random data, it's actually more efficient to first generate the random values into an array (unsorted) - O(n) time complexity - and then sort it with your favorite O(n log n) sort algorithm, making the entire operation O(2n log n) time complexity, and, depending on sort algorithm used, between O(1) and O(n) space complexity.
There is no way to beat that approach by "keeping an array sorted as it's constructed" for random data, because any approach will require exactly O(n) generations/insertions of the values, and at least O(n log n) comparisons/swaps/shifts - no matter which method, from the numerous mentioned in comments on the original question, is used. Note, as per a very useful comment on my original answer, the binary insertion sort variant suggested in the original question will likely degrade to O(n^2) time complexity, making it an inferior solution to just generating an array of values first and then sorting it.
Using a balanced tree just matches the time complexity of generating an array and then sorting it - but loses in space complexity, as trees have some overhead, compared to an array, to keep track of child nodes, etc. Also of note, trees are heap-allocated, and require a pointer dereference operation for accessing any child node - so even though the Big-O time complexity is equivalent to first generating an array of data and then sorting it, the real performance of the tree solution will be worse, as there's no data locality, and there's extra cost of pointer dereference. An additional consideration on balanced trees is that insertion cost into something like an AVL is quite high - that is, the n in AVL's O(n log n) insertion is not the same cost as n in an in-place sort of an array, due to necessary rotations of tree nodes to achieve balance. Just because Big-O is the same doesn't mean performance is the same. Even if you have an absolute need to be able to grab the data in a sorted order at some point during construction of the array, it might still be cheaper to just sort an array as you need it, unless you need it sorted at each insertion!
Note, this answer pertains to random data - it is possible, and even likely, to come up with a more efficient approach for "keeping an array sorted as it's constructed" if both the size and characteristics of the data are known, and follow some mathematical pattern, other than randomness; however, such approach would necessarily be overfit for the specific data set it relates to, rather than a general solution.
I recommend the Heapsort or Mergesort.
Heapsort is a comparison-based algorithm that uses a binary heap data structure to sort elements. It divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element and moving that to the sorted region.
Mergesort is a comparison-based algorithm that focuses on how to merge together two pre-sorted arrays such that the resulting array is also sorted.
If you want a true O(nlogn) and sorted "as it is constructed", I would recommend using a proper (tree) based data structure instead of array. You can use data structures like self balanced binary tree, AVL trees.
Just to verify, am I correct, that if:
I have a sorted, one dimensional array.
A nearest neighbor search using a KD-Tree will always be at best as fast as using a binary search on that array? (For normal search of course the same)
The same goes for range search (getting all elements in range x..y).
The only advantage that I may have with a KD-Tree is when there is frequent insertion / deletion of the data.
This has been asked for Binary Trees and N-Dimensions in general, but I want to know this for the KD-Tree and 1 Dimensional data specifically.
A nearest neighbor search using a KD-Tree will always be at best as fast as using a binary search on that array? (For normal search of course the same)
Correct. If the KD-Tree is slightly degenerated (through update operations) you will be worse off.
The same goes for range search (getting all elements in range x..y).
Correct. Once you find the smallest value greater than x with binary search you can just scan until you hit y. In a KD-tree you will have to walk through all the nodes where the keys in your range are located.
The only advantage that I may have with a KD-Tree is when there is
frequent insertion / deletion of the data.
It depends what you mean by advantage.
Insertion and deletion is faster in KD-Trees than in sorted arrays.
However, the search on the KD-tree will become slower with more insertions/deletions, since the KD-tree will degenerate (if you only use the base KD-tree without adaptations for updates). The binary search will stay at O(log n).
Not your question but, if you are operating in 1D you will most likely use best of both worlds which means Red-Black Trees, B+-Trees or something similar.
I understand that binary search cannot be done for an unordered array.
I also understand that the complexity of a binary search in an ordered array is O(log(n)).
Can I ask
what is the complexity for binary search(insertion) for an
ordered array? I saw from a textbook, it stated that the complexity
is O(n). Why isn't it O(1) since, it can insert directly, just like
linear search.
Since binary search can't be done in unordered list, why is it
possible to do insertion, with a complexity of O(N)?
insertion into list complexity depends on used data structure:
linear array
In this case you need to move all the items from index of inserting by one item so you make room for new inserted item. This is complexity O(n).
linked list
In this case you just changing the pointers of prev/next item so this is O(1)
Now for the ordered list if you want to use binary search (as you noticed) you can use only array. The bin-search insertion of item a0 into ordered array a[n] means this:
find where to place a0
This is the bin search part so for example find index ix such that:
a[ix-1]<=a0 AND a[ix]>a0 // for ascending order
This can be done by bin search in O(log(n))
insert the item
so you need first to move all the items i>=ix by one to make place and then place the item:
for (int i=n;i>ix;i--) a[i]=a[i-1]; a[ix]=a0; n++;
As you can see this is O(n).
put all together
so O(n+log(n)) = O(n) that is why.
BTW. search on not strictly ordered dataset is possible (although it is not called binary search anymore) see
How approximation search works
Why do people use binary search trees?
Why not simply do a binary search on the array sorted from lowest to highest?
To me, an insertion / deletion cost seems to be the same, why complicate life with processes such as max/min heapify etc?
Is it just because of random access required within a data structure?
The cost of insertion is not the same. If you want to insert an item in the middle of an array, you have to move all elements to the right of the inserted element by one position, the effort for that is proportional to the size of the array: O(N). With a self-balancing binary tree the complexity of insertion is much lower: O(ln(N)).
I'm looking for a data structure (or structures) that would allow me keep me an ordered list of integers, no duplicates, with indexes and values in the same range.
I need four main operations to be efficient, in rough order of importance:
taking the value from a given index
finding the index of a given value
inserting a value at a given index
deleting a value at a given index
Using an array I have 1 at O(1), but 2 is O(N) and insertion and deletions are expensive (O(N) as well, I believe).
A Linked List has O(1) insertion and deletion (once you have the node), but 1 and 2 are O(N) thus negating the gains.
I tried keeping two arrays a[index]=value and b[value]=index, which turn 1 and 2 into O(1) but turn 3 and 4 into even more costly operations.
Is there a data structure better suited for this?
I would use a red-black tree to map keys to values. This gives you O(log(n)) for 1, 3, 4. It also maintains the keys in sorted order.
For 2, I would use a hash table to map values to keys, which gives you O(1) performance. It also adds O(1) overhead for keeping the hash table updated when adding and deleting keys in the red-black tree.
How about using a sorted array with binary search?
Insertion and deletion is slow. but given the fact that the data are plain integers could be optimized with calls to memcpy() if you are using C or C++. If you know the maximum size of the array, you can even avoid any memory allocations during the usage of the array, as you can preallocate it to the maximum size.
The "best" approach depends on how many items you need to store and how often you will need to insert/delete compared to finding. If you rarely insert or delete a sorted array with O(1) access to the values is certainly better, but if you insert and delete things frequently a binary tree can be better than the array. For a small enough n the array most likely beats the tree in any case.
If storage size is of concern, the array is better than the trees, too. Trees also need to allocate memory for every item they store and the overhead of the memory allocation can be significant as you only store small values (integers).
You may want to profile what is faster, the copying of the integers if you insert/delete from the sorted array or the tree with it's memory (de)allocations.
I don't know what language you're using, but if it's Java you can leverage LinkedHashMap or a similar Collection. It's got all of the benefits of a List and a Map, provides constant time for most operations, and has the memory footprint of an elephant. :)
If you're not using Java, the idea of a LinkedHashMap is probably still suitable for a usable data structure for your problem.
Use a vector for the array access.
Use a map as a search index to the subscript into the vector.
given a subscript fetch the value from the vector O(1)
given a key, use the map to find the subscript of the value. O(lnN)
insert a value, push back on the vector O(1) amortized, insert the subscript into
the map O(lnN)
delete a value, delete from the map O(lnN)
Howabout a Treemap? log(n) for the operations described.
I like balanced binary trees a lot. They are sometimes slower than hash tables or other structures, but they are much more predictable; they are generally O(log n) for all operations. I would suggest using a Red-black tree or an AVL tree.
How to achieve 2 with RB-trees? We can make them count their children with every insert/delete operations. This doesn't make these operationis last significantly longer. Then getting down the tree to find the i-th element is possible in log n time. But I see no implementation of this method in java nor stl.
If you're working in .NET, then according to the MS docs http://msdn.microsoft.com/en-us/library/f7fta44c.aspx
SortedDictionary and SortedList both have O(log n) for retrieval
SortedDictionary has O(log n) for insert and delete operations, whereas SortedList has O(n).
The two differ by memory usage and speed of insertion/removal. SortedList uses less memory than SortedDictionary. If the SortedList is populated all at once from sorted data, it's faster than SortedDictionary. So it depends on the situation as to which is really the best for you.
Also, your argument for the Linked List is not really valid as it might be O(1) for the insert, but you have to traverse the list to find the insertion point, so it's really not.