How would you go about implementing a balanced tree with lazy deletions? - avl-tree

More specifically, an AVL tree. Is it possible? I'd like to do it, but I'm thinking the undeleted nodes may be problematic to manage with the rotations.
I have one that works normally, but I'd like to use this one with lazy deletion for something else.

If you want it to remain "balanced" with respect to all node (including the ones marked deleted) you don't have to do anything--you're already there.
If you want it to remain "balanced" with respect to the set of undeleted nodes--the question is why? The whole point of balancing is to prevent run away (linear worst case) searches and that depends on the nodes, not their deletion status.

What exactly does not deleting mean in the context of an AVL tree?
It could mean you do no work on deletion, which is to say, you don't update the tree at all.
This will cause the tree to rebalance incorrectly, because the upward scan for balance factors will be working with incorrect balance factors.
It could mean updating the balance factors but not balancing.
This means you would end up, when you did decide to delete something, with balance factors greater than 2 or smaller than -2; which implies multiple rotations to correct. The problem here though is that you can no longer know, upon a rotation, whether or not you've eliminated a subtree depth; because although you know there are say 3 subtree depths too many on one side, you no longer know it's exactly one element which is causing each level of that extra depth - something you normally know because you're adding or removing single elements at a time - so you have no idea how many rotations you need to do. You might do three rotations and only have lost one subtree depth, because there were two elements at that depth. In fact, how would you even be able to know which elements to rotate to get at the necessary elements? they wouldn't necessarily all exist in the path from your chosen delete element and the point where the balance factor is 3.
I'm not certain, but I'll go out on a limb and say lazy deletes breaks AVL as we know it.
Why would you want to delay, anyway? the whole point of AVL is to amortize the rebalance cost across each add/delete so you stay at O(log n) - why build up rebalance debt for larger, less frequent rebalancing?

Related

Data structure that would be optimal for required set of operations

I am looking for an optimal data structure for these pseudo-code operations:
add/update(key, node-content);
node=walk/iterate();
delete(node);
I would do a hash-table, but it is not possible to walk/iterate through it in efficient way (all buckets need be examined). Currently, I consider rbtree, but my doubts revolve around the fact that I need re-balance it at each add/delete operation keeping a global tree mutex presumably... Could anyone share some expertise what are the best options may be?
UPDATE I would like to expand on usage of the sought data structure as it would clarify the questions being asked so far.
The structure will have a few hundred nodes at most - so not that large. The most frequent operation will be walk(), wherein each and every node is read by turn, the order does not matter. walk() can happen thousand times a second. So far linked-list or even array would do.
The second most frequent operation is update(node-key, node-content). This is where efficient search needed. The operation is likely to occur many hundreds times a second. Thus, hash table is appealing.
Sometimes, a node will be added - when the update() doesn't find an existing node - or deleted. Add()/delete() happens rarely, say, once a day - so the cost of these operations is irrelevant.
UPDATE2 I think here is a very good recap on structures in question. Currently I gravitate towards skiplist

Scanning File to find exact size of HashTable size vs constantly resizing Array and ReHashing and other questions

So I am doing a project, that requires me to find all the anagrams in a given file. Each file has words on each line.
What I have done so far:
1.) sort the word (using Mergesort - (I think this is the best in the worst case.. right?))
2.) place into the hashtable using a hash function
3.) if there is a collision move to the next available space further in the array (basically going down one by one until you see an empty spot in the hashtable) (is there a better way for this? What I am doing in linear probing).
Problem:
When it runs out of space in the hash table.. what do I do? I came up with two solutions, either scan the file before inputing anything into the hash table and have one exact size or keep resizing the array and rehashing as it get more and more full. I don't know which one to choose. Any tips would be helpful.
A few suggestions:
Sorting is often a good idea, and I can think of a way to use it here, but there's no advantage to sorting items if all you do afterwards is put them into a hashtable. Hashtables are designed for constant-time insertions even when the sequence of inserted items is in no particular order.
Mergesort is one of several sorting algorithms with O(nlog n) worst-case complexity, which is optimal if all you can do is compare two elements to see which is smaller. If you can do other operations, like index an array, O(n) sorting can be done with radixsort -- but it's almost certainly not worth your time to investigate this (especially as you may not even need to sort at all).
If you resize a hashtable by a constant factor when it gets full (e.g. doubling or tripling the size) then you maintain constant-time inserts. (If you resize by a constant amount, your inserts will degrade to linear time per insertion, or quadratic time over all n insertions.) This might seem wasteful of memory, but if your resize factor is k, then the proportion of wasted space will never be more than (k-1)/k (e.g. 50% for doubling). So there's never any asymptotic execution-time advantage to computing the exact size beforehand, although this may be some constant factor faster (or slower).
There are a variety of ways to handle hash collisions that trade off execution time versus maximum usable density in different ways.

C: Storing up to a million entries in a hash table

I'm working on a project where efficiency is crucial. A hash table would be very helpful since I need to easily look up the memory address of a node based on a key. The only problem I foresee is this hash table will need to handle up to 1 million entries. As I understand it usually hash tables buckets are a linked list so that they can handle multiple entries in the same bucket. It seems to me that with a million entries these lists would be way too slow. What is the common way of implementing something like this. Maybe swapping a standard linked list out for a skip list?
If you want a hash table with a million entries, normally you'd have at least 2 million buckets. I don't remember all the statistics (the key term is "birthday paradox"), but the vast majority of the buckets will have zero or one items. You can, in principle, be very unlucky and get all items in one bucket - but you'd have to be even more unlucky than those people who seem to get struck by lightning every other day.
For hashtables that grow, the normal trick is to grow by a constant percentage - the usual textbook case being growth by doubling the hash-table size. You do this whenever the number of items in the hashtable reaches a certain proportion of the hashtable size, irrespective of how many buckets are actually being used. This gives amortized expected performance of O(1) for inserts, deletes and searches.
The linked list in each bucket of a hash-table is just a way of handling collisions - improbable in a per-operation sense, but over the life of a significant hash table, they do happen - especially as the hash-table gets more than half full.
Linked lists aren't the only way to handle collisions - there's a huge amount of lore about this topic. Walter Bright (developer of the D programming language) has advocated using binary trees rather than linked lists, claiming that his Dscript gained a significant performance boost relative to Javascript from this design choice.
He used simple (unbalanced) binary trees when I asked, so the worst-case performance was the same as for linked lists, but the key point I guess is that the binary tree handling code is simple, and the hash table itself makes the odds of building large unbalanced trees very small.
In principle, you could just as easily use treaps, red-black trees or AVL trees. An interesting option may be to use splay trees for collision handling. But overall, this is a minor issue for a few library designers and a few true obsessives to worry about.
You lose all the advantages of a hash table if the per-bucket lists ever have more than a few entries. The usual way to make a hash table scale to millions of entries is to make the primary hash array resizable, so even with millions of entries, the bucket lists stay short.
You can use a Tree instead of a List in the individual "buckets". (AVL or similar)
EDIT: well, Skip List would do too. (and seems to be faster) - O(log n) is what you aim for.
The total number of entries does not matter, only the average number of entries per bucket (N / size of hash). Use a hash function with larger domain (for example, 20 bits, or even larger) to ensure that.
Of course, this will take up more memory, but that's it, it's a common memory vs speed tradeoff.
Not sure if this will help you or not, but maybe: http://memcached.org/
If your keys have normal distribution (That's a very big IF), then the expected number of insertions into the hashtable to exhaust all the buckets in the hashtable is M*logM ( Natural log, to the base e), where M is the number of buckets.
Was surprised couldn't find this easily online!
I have posted the derivation of the same on my blog,and verified it with Code, using rand().It does seem to be a pretty good estimate.

Does it make sense to resize an Hash Table down? And When?

My Hash Table implementation has a function to resize the table when the load reaches about 70%. My Hash Table is implemented with separate chaining for collisions.
Does it make sense that I should resize the hash table down at any point or should I just leave it like it is? Otherwise, if I increase the size (by almost double, actually I follow this: Link) when the load is 70%, should I resize it down when the load gets 30% or below?
Hash tables don't have to have prime-number lengths if you have a good quality hash function (see here). You can make them powers of two, which substantially speeds up index computations.
Why is this relevant to the question? Because when you shrink a power-of-two hashtable, you can leave all entries in the bottom half where they are and simply append the linked list in slot i (from the upper half) onto the linked list in slot i - n/2.
If memory is cheap, leave it alone. If memory is expensive, resize with hysterisis as you have suggested. When done, profile the result to make sure it performs well and haven't done something silly.
Are you writing the hash table for general purpose use, or is there a specific purpose for it? I suggest not resizing smaller for a general implementation. This will keep your table simple and keep it from memory thrashing under conditions where the table is filled and emptied often. If you end up running into a condition where the hash table needs to be reduced in size, extend it at that point in time.
First idea: The only reason for growing a hashtable is because hashtable performance decreases if there are too many collisions. Growing the table when its load exceeds 70% is a good rule of the thumb to prevent this from happening but its just a rule of the thumb. Much better is to keep track of the number of collisions and only grow the hashtable if they exceed a certain limit or once a certain collision ratio is hit. After all, why would you want to grow a hashtable that is loaded by 90%, yet has not a single collision? It would have no advantage.
Second idea: The only reason to shrink a hashtable is to save memory, yet shrinking it could increase the number of collisions and thus decrease the lookup performance. This is a classical speed vs memory trade off and why should you solve it yourself? Leave it to whoever is using your code. Just never shrink on your own but offer a shrink method. If low memory usage is a requirement, whoever is using your code can call shrink regularly. If maximum performance if a requirement, whoever is using your code should never call shrink. Everyone else can use some kind of heuristic to decide if and when to call shrink.
Third idea: When growing or shrinking, always grow/shrink in such a way that after the operation a certain load factor is guaranteed. E.g. when growing, always grow so that afterwards the load factor is 50% and when shrinking, always shrink in such a way that afterwards the load factor is 70%. Of course, that says nothing about the number of collisions, so adding an element immediately after growing/shrinking may cause the hashtable to grow again but that is unavoidable as simulating the effect of a grow/shrink is usually too expensive. Also shrink will often be called once no further modifications are planed, thus it should rather save memory than avoid having to grow again in the future.
Last idea: For every decision you make, you will make the hashtable better for some usage cases and worse for other ones. If you know how your hashtable is going to be used, this won't be a problem. Yet if you don't, and usually you don't, why making these decisions yourself? Just delegate them. Allow the user of your code to customize all the small details, e.g. how much to grow or shrink, either by allowing all these factors to be set when your hashtable is being created or by allowing your hashtable to have delegate functions (callback functions that you can always ask when unsure what to do). That way every user of your code can customize your code even at runtime for whatever usage scenario they require it.

When to switch from unordered lists to sorted lists ? [optimization]

I have to implement an algorithm to decompose 3D volumes in voxels. The algorithm starts by identifying which vertexes is on each side of the cut plan and in a second step which edge traverse the cutting plan.
This process could be optimized by using the benefit of sorted list. Identifying the split point is O log(n). But I have to maintain one such sorted list per axis and this for vertexes and edges. Since this is to be implemented to be used by GPU I also have some constrains on memory management (i.e. CUDA). Intrusive listsM/trees and C are imposed.
With a complete "voxelization" I expect to endup with ~4000 points, and 12000 edges. Fortunately this can be optimized by using a smarter strategy to get rid of processed voxels and order residual volumes cutting to keep their number to a minimum. In this case I would expect to have less then 100 points and 300 edges. This makes the process more complex to manage but could end up beeing more efficient.
The question is thus to help me identify the criteria to determine when the benefit of using a sorted data structure is worth the effort and complexity overhead compared to simple intrusive linked lists.
chmike, this really sounds like the sort of thing you want to do first the simpler way, and see how it behaves. Any sort of GPU voxelization approach is pretty fragile to system details once you get into big volumes at least (which you don't seem to have). In your shoes I'd definitely want the straightforward implementation first, if for no other reason that to check against....
The question will ALWAYS boil down to which operator is most common, accessing, or adding.
If you have an unordered list, adding to it takes no time, and accessing particular items takes extra time.
If you have a sorted list, adding to it takes more time, but accessing it is quicker.
Most applications spending most of their time accessing the data, rather than adding to it, which means that the (running) time overhead in creating a sorted list will usually be balanced or covered by the time saved in accessing the list.
If there is a lot of churn in your data (which it doesn't sound like there is) then maintaining a sorted list isn't necessarily advisable, because you will be constantly resorting the list as considerable CPU cost.
The complexity of the data structures only matters if they cannot be sorted in a useful way. If they can be sorted, then you'll have to go by the heuristic of
number of accesses:number of changes
to determine if sorting is a good idea.
After considering all answers I found out that the later method used to avoid duplicate computation would end up being less efficient because of the effort to maintain and navigate in the data structure. Beside, the initial method is straightforward to parallelize with a few small kernel routines and thus more appropriate for GPU implementation.
Checking back my initial method I also found significant optimization opportunities that leaves the volume cut method well behind.
Since I had to pick one answer I chose devinb because he answer the question, but Simon's comment, backed up by Tobias Warre comment, were as valuable for me.
Thanks to all of you for helping me sorting out this issue.
Stack overflow is an impressive service.

Resources