This may or may not exist, but I'm looking for a way of storing a sorted list of integers that's contiguous in memory, reasonably compact, and allows for O(log n) amortized inserts and deletes. The various self-balancing binary search trees seem to have the insertion and deletion properties I want, but are implemented with pointers all over the place, which doesn't fit my use case very well. Any ideas?
(The implementation language will almost definitely be C, if it matters. If there are existing implementations of whatever you propose, all the better, but I'm fine with writing my own.)
A binary search tree can be implemented using an array.
Accroding to the fact that "reads happen more then writes", you can try to use dynamicaly-resized array + binary search on top of it. So, you will get O(log n) time access to elements (read), but you have to pay O(n) for insert/delete (O(log n) - search the proper place in array + O(n) - shift elemets to right or left). It is kind of slow, but this is the way how to work it out. Try to think about it.
One thing you might consider is a log-structured merge tree. You can store the levels contiguously one after the other, if you keep some metadata about where each level is located. You'd want an array where you can push and pop from the end (like a C++ std::vector).
Related
Background
Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one? Here's an example that will clear things up:
Example
I'm generating N random numbers and want to insert them into a new array as I generate them, and I want the final array to be sorted.
Possible Solutions
Insertion Sort
My gut told me that putting each element in the correct place as it's generated would be fastest. This is accomplished by doing a binary search to find the correct point in the array to insert the new element. However, this is an insertion sort, which is known to be less efficient on large lists than other sorting algorithms.
Quicksort
Quicksort is generally thought of as the most efficient 'general' sorting algorithm, where nothing is known about the inputs to the array, and it's more efficient than insertion sort on large lists. Would it, therefore, be best to simply put the random numbers in the array in an unsorted order, and then sort them at the end with quicksort?
Other Solutions
Is there another algorithm I haven't thought of?
Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one?
It boils down to the same problem for random data, due to efficiency considerations.
Given random data, it's actually more efficient to first generate the random values into an array (unsorted) - O(n) time complexity - and then sort it with your favorite O(n log n) sort algorithm, making the entire operation O(2n log n) time complexity, and, depending on sort algorithm used, between O(1) and O(n) space complexity.
There is no way to beat that approach by "keeping an array sorted as it's constructed" for random data, because any approach will require exactly O(n) generations/insertions of the values, and at least O(n log n) comparisons/swaps/shifts - no matter which method, from the numerous mentioned in comments on the original question, is used. Note, as per a very useful comment on my original answer, the binary insertion sort variant suggested in the original question will likely degrade to O(n^2) time complexity, making it an inferior solution to just generating an array of values first and then sorting it.
Using a balanced tree just matches the time complexity of generating an array and then sorting it - but loses in space complexity, as trees have some overhead, compared to an array, to keep track of child nodes, etc. Also of note, trees are heap-allocated, and require a pointer dereference operation for accessing any child node - so even though the Big-O time complexity is equivalent to first generating an array of data and then sorting it, the real performance of the tree solution will be worse, as there's no data locality, and there's extra cost of pointer dereference. An additional consideration on balanced trees is that insertion cost into something like an AVL is quite high - that is, the n in AVL's O(n log n) insertion is not the same cost as n in an in-place sort of an array, due to necessary rotations of tree nodes to achieve balance. Just because Big-O is the same doesn't mean performance is the same. Even if you have an absolute need to be able to grab the data in a sorted order at some point during construction of the array, it might still be cheaper to just sort an array as you need it, unless you need it sorted at each insertion!
Note, this answer pertains to random data - it is possible, and even likely, to come up with a more efficient approach for "keeping an array sorted as it's constructed" if both the size and characteristics of the data are known, and follow some mathematical pattern, other than randomness; however, such approach would necessarily be overfit for the specific data set it relates to, rather than a general solution.
I recommend the Heapsort or Mergesort.
Heapsort is a comparison-based algorithm that uses a binary heap data structure to sort elements. It divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element and moving that to the sorted region.
Mergesort is a comparison-based algorithm that focuses on how to merge together two pre-sorted arrays such that the resulting array is also sorted.
If you want a true O(nlogn) and sorted "as it is constructed", I would recommend using a proper (tree) based data structure instead of array. You can use data structures like self balanced binary tree, AVL trees.
I want to store a small amount of items( less than 255) which have constant size (a c char )and be able to do the following operations:
Append a value to an arbitrary position and have the other items preserve their previous order.
Delete an item and have the other items preserve their order(as above).
Find the next and previous of an item.
I have tried using an array and making a function to add a value by moving all items after it a place forward.Same thing can happen with deleting, but it is too inefficient.Of course, I do not mind having to use a library, long as it is readily available and free.
Array - access: O(1), insert: O(n)
Double-linked list - access O(n), previous/next: O(1), insert(*): O(1)
RB tree with number of childs stored: O(log n) for all operations.
(*): You need the traverse the list first to get to the position (O(n)).
Note: no, the array is not messy, it's really simple to implement. Also as you can see, depending on the usage, it can be quite efficient.
Based on the number of elements, and your remark to array implementation you should stick to arrays.
You could use a double-linked list for it. However, this won't work if you want to keep the array behaviour (e.g. accessing elements quickly (O(1), for a LL it's O(n)) by their index)
I would like to write a piece of code for inserting a number into a sorted array at the appropriate position (i.e. the array should still remain sorted after insertion)
My data structure doesn't allow duplicates.
I am planning to do something like this:
Find the right index where I should be putting this element using binary search
Create space for this element, by moving all the elements from that index down.
Put this element there.
Is there any other better way?
If you really have an array and not a better data structure, that's optimal. If you're flexible on the implementation, take a look at AA Trees - They're rather fast and easy to implement. Obviously, takes more space than array, and it's not worth it if the number of elements is not big enough to notice the slowness of the blit as compared to pointer magic.
Does the data have to be sorted completely all the time?
If it is not, if it is only necessary to access the smallest or highest element quickly, Binary Heap gives constant access time and logn addition and deletion time.
More over it can satisfy your condition that the memory should be consecutive, since you can implement a BinaryHeap on top of an array (I.e; array[2n+1] left child, array[2n+2] right child).
A heap based implementation of a tree would be more efficient if you are inserting a lot of elements - log n for both locating/removing and inserting operations.
when implementing a heap structure, we can store the data in an array such that the children of the node at position i are at position 2i and 2i+1.
my question is, why dont we use an array to represent binary search trees and instead we deal with pointers etc.?
thanks
Personally
Because using pointers its easier to
grow the data structure size
dynamically
I find It's easier to maintain bin
tree than a heap
The algorithms to balance, remove, insert elements in the tree will alter only pointers and not move then physically as in a vector.
and so on...
If the position of all children is statically precomputed like that, then the array essentially represents a completely full, completely balanced binary tree.
Not all binary trees in "real life" are completely full and perfectly balanced. If you should happen to have a few especially long branches, you'd have to make your whole array a lot larger to accomodate all nodes at the bottom-most level.
If an array-bound binary tree is mostly empty, most of the array space is wasted.
If only some of the tree's branches are deep enough to reach to the "bottom" of the array, there's also a lot of space being wasted.
If the tree (or just one branch) needs to grow "deeper" than the size of the array will allow, this would require "growing" the array, which is usually implemented as copying to a larger array. This is a time-expensive operation.
So: Using pointers allows us to grow the structure dynamically and flexibly. Representing a tree in an array is a nice academic exercise and works well for small and simple cases but often does not fulfill the demands of "real" computing.
Mainly because the recursive tree allows for very simple code. If you flatten the tree into an array, the code becomes really complex because you have to do a lot of bookkeeping which the recursive algorithm does for you.
Also, a tree of height N can have anything between N and 2^(N+1)-1 nodes (. Only the actual nodes will need memory. If you use an array, you must always allocate space for all nodes (even the empty ones) unless you use a sparse array (which would make the code even more complex). So while it is easy to keep a sparse tree of height 100 in memory, it would be problematic to find a computer which can allocate 20282409603651670423947251286008 bytes of RAM.
To insert an element into a heap, you can place it anywhere and swap it with its parent until the heap constraint is valid again. Swap-with-parent is an operation that keeps the binary tree structure of the heap intact. This means a heap of size N will be represented as an N-cell array, and you can add a new element in logarithmic time.
A binary search tree can be represented as an array of size N using the same representation structure as a heap (children 2n and 2n+1), but inserting an element this way is a lot harder, because unlike the heap constraint, the binary search tree constraint requires rotations to be performed to retrieve a balanced tree. So, either you do manage to keep an N-node tree in an N-cell array at a cost higher than logarithmic, or you waste space by keeping the tree in a larger array (if my memory serves, a red-back tree could waste as much as 50% of your array).
So, a binary search tree in an array is only interesting if the data inside is constant. And if it is, then you don't need the heap structure (children 2n and 2n+1) : you can just sort your array and use binary search.
As far as I know, we can use Array to represent binary search trees.
But it is more flexible to use pointers.
The array based implementation is useful if you need a heap that is used as a priority queue in graph algorithms. In that case, the elements in the heap are constant, you pop the top most element and insert new elements. Removing the top element (or min-element) requires some re-balancing to become a heap again, which can be done such that the array is reasonably balanced.
A reference for this is the algorithm by Goldberg and Tarjan about efficiently computing optimal network flow in directed graphs, iirc.
Heap data structure is a complete binary tree unlike BST. Hence, using arrays is not of much use for BST.
I'm looking for a data structure (or structures) that would allow me keep me an ordered list of integers, no duplicates, with indexes and values in the same range.
I need four main operations to be efficient, in rough order of importance:
taking the value from a given index
finding the index of a given value
inserting a value at a given index
deleting a value at a given index
Using an array I have 1 at O(1), but 2 is O(N) and insertion and deletions are expensive (O(N) as well, I believe).
A Linked List has O(1) insertion and deletion (once you have the node), but 1 and 2 are O(N) thus negating the gains.
I tried keeping two arrays a[index]=value and b[value]=index, which turn 1 and 2 into O(1) but turn 3 and 4 into even more costly operations.
Is there a data structure better suited for this?
I would use a red-black tree to map keys to values. This gives you O(log(n)) for 1, 3, 4. It also maintains the keys in sorted order.
For 2, I would use a hash table to map values to keys, which gives you O(1) performance. It also adds O(1) overhead for keeping the hash table updated when adding and deleting keys in the red-black tree.
How about using a sorted array with binary search?
Insertion and deletion is slow. but given the fact that the data are plain integers could be optimized with calls to memcpy() if you are using C or C++. If you know the maximum size of the array, you can even avoid any memory allocations during the usage of the array, as you can preallocate it to the maximum size.
The "best" approach depends on how many items you need to store and how often you will need to insert/delete compared to finding. If you rarely insert or delete a sorted array with O(1) access to the values is certainly better, but if you insert and delete things frequently a binary tree can be better than the array. For a small enough n the array most likely beats the tree in any case.
If storage size is of concern, the array is better than the trees, too. Trees also need to allocate memory for every item they store and the overhead of the memory allocation can be significant as you only store small values (integers).
You may want to profile what is faster, the copying of the integers if you insert/delete from the sorted array or the tree with it's memory (de)allocations.
I don't know what language you're using, but if it's Java you can leverage LinkedHashMap or a similar Collection. It's got all of the benefits of a List and a Map, provides constant time for most operations, and has the memory footprint of an elephant. :)
If you're not using Java, the idea of a LinkedHashMap is probably still suitable for a usable data structure for your problem.
Use a vector for the array access.
Use a map as a search index to the subscript into the vector.
given a subscript fetch the value from the vector O(1)
given a key, use the map to find the subscript of the value. O(lnN)
insert a value, push back on the vector O(1) amortized, insert the subscript into
the map O(lnN)
delete a value, delete from the map O(lnN)
Howabout a Treemap? log(n) for the operations described.
I like balanced binary trees a lot. They are sometimes slower than hash tables or other structures, but they are much more predictable; they are generally O(log n) for all operations. I would suggest using a Red-black tree or an AVL tree.
How to achieve 2 with RB-trees? We can make them count their children with every insert/delete operations. This doesn't make these operationis last significantly longer. Then getting down the tree to find the i-th element is possible in log n time. But I see no implementation of this method in java nor stl.
If you're working in .NET, then according to the MS docs http://msdn.microsoft.com/en-us/library/f7fta44c.aspx
SortedDictionary and SortedList both have O(log n) for retrieval
SortedDictionary has O(log n) for insert and delete operations, whereas SortedList has O(n).
The two differ by memory usage and speed of insertion/removal. SortedList uses less memory than SortedDictionary. If the SortedList is populated all at once from sorted data, it's faster than SortedDictionary. So it depends on the situation as to which is really the best for you.
Also, your argument for the Linked List is not really valid as it might be O(1) for the insert, but you have to traverse the list to find the insertion point, so it's really not.