Best practical priority queue data structure in the following workload? - c

My goal to find what is the best data structure priority queue (PQ) given the following requirements:
The workload: Initially, we insert to the PQ a number of k elements. Next, there are N operation, where N>>k and very large. In each operation we take the minimal value of the PQ, and update it into a larger of into a larger value.
The data structure should be used for multithreading. This means, finding the minimal value and updating it into a larger value can be parallel. Of course, given L threads, the algorithm should work at O(N/L * log N) time.
The data structure should be cache efficient. This means, each thread should have locality of reference and be fast.
The data structure is not need to be space effect, although it is recommended to be one.
Now, there are serval PQ I can used, but all have problems:
A tree based solution (AVL, balanced tree, RB tree) or the skip-list solution are using link-lists, and thus not cache efficient. This is, as well Fibonacci/Pairing heaps.
The base heap solution is not cache efficient
The d-ary heap solution is not multithreaded easily.
My solution: I am thinking about using the d-ary heap solution, with a multi-threaded algorithm, where the SWAP operation is locked if it is already accessed by another thread. But I am not sure it will work.
Your thoughts?

Related

Data structure that would be optimal for required set of operations

I am looking for an optimal data structure for these pseudo-code operations:
add/update(key, node-content);
node=walk/iterate();
delete(node);
I would do a hash-table, but it is not possible to walk/iterate through it in efficient way (all buckets need be examined). Currently, I consider rbtree, but my doubts revolve around the fact that I need re-balance it at each add/delete operation keeping a global tree mutex presumably... Could anyone share some expertise what are the best options may be?
UPDATE I would like to expand on usage of the sought data structure as it would clarify the questions being asked so far.
The structure will have a few hundred nodes at most - so not that large. The most frequent operation will be walk(), wherein each and every node is read by turn, the order does not matter. walk() can happen thousand times a second. So far linked-list or even array would do.
The second most frequent operation is update(node-key, node-content). This is where efficient search needed. The operation is likely to occur many hundreds times a second. Thus, hash table is appealing.
Sometimes, a node will be added - when the update() doesn't find an existing node - or deleted. Add()/delete() happens rarely, say, once a day - so the cost of these operations is irrelevant.
UPDATE2 I think here is a very good recap on structures in question. Currently I gravitate towards skiplist

Binary heap data structure - Application

As per my understanding,
Binary heap(data structure) is used to represent Priority queue ADT. It is a complete binary tree satisfying heap property.
Heap property - If A is a parent node of B then the key (the value) of node A is ordered with respect to the key of node B with the same ordering applying across the heap.
Firstly, it helps me remember term heap, if there is a reason behind terming this data structure as heap. Because, we also use the term heap memory.
Dictionary meaning of heap - an untidy collection of things piled up haphazardly.
Question,
After learning Reb-Black tree & AVL tree data structure,
Why do we think of new data structure(Binary heap)?
Does Binary Heap solve set of problems that Red-Black or AVL tree does not fit into?
The major difference between a binary heap and a red-black tree is the performance on certain operations.
Binary Heap
Pros
It makes an ideal priority queue, since the min/max element (depending on your implementation) is always O(1) access time, so no need to search for it.
It's also very fast for insertion of new values (O(1) on average, O(log(n)) worst case.
Cons
Slow searches for random elements.
RB Tree
Pros
Better searching and insertion performance.
Cons
Slower min/max searches.
More overhead in general.
It should be noted that RB trees can make good schedulers too, such as the Completely Fair Scheduler introduced in Linux kernel v2.6.

Copy large array or change access index

Apologies if this is question has an obvious answer but I can't phrase it properly to find an answer online.
In Fortran, suppose I have an array (>100,000) of real numbers. I will continuously access this array (in a consecutive manner) over and over again throughout one step of a time integration scheme. Each subsequent steps some elements of this array will no longer be needed. I do not know how many, it could be anywhere from none of them to all of them. My question is:
Is it better to: (1) Go through this array every step and copy the remaining elements I need to a new array, even though only a very small percentage might need to be taken out, or (2) should I have a new array of integer indexes which I update every timestep to access this array. My understanding is that if the memory access is consecutive it should be very quick and I think this should outweigh the cost of copying the array. On the other hand updating the integer indexes would be very speedy but the cost would be that the data would then be fragmented and accessing it would be slower.
Or is this the type of question with no definitive answer and I need to go and test both methodologies to find which is better for my application?
Hard to say beforehand, so the easy answer would indeed be *"Measure!"
Some speculation might help with what to measure, though. Everything that follows under the assumption that the code is performance critical indeed.
Memory Latency:
100k elements does usually exceed L1 and L2 cache, so memory locality will play a role. OTOH, a linear scan is way better than a scatter access.
If memory latency is significant compared to the per-element operations, and the majority of elements become "uninteresting" after a given number of iterations, I would aim at:
mark individual elements as "to be skipped in future iterations"
compact the memory (i.e. remove skippable elements) when ~ 50% of the elements become skippable
(Test for above conditions: for a naive implementation, does the time of a single iteration grow faster-than-linear with number of elements?)
Cache-friendly blocks:
If memory latency is an issue and it is possible to apply multiple operations to a small chunk (say, 32KiB of data), do so.
Parallelization:
(the elephant in the room). Can be added easily if you can process in cache-friendly blocks.

Asymptotically Fast Associative Array with Low Memory Requirements

Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.
Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.
The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?
It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.
Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.
There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.
Dan Gusfield's book is the Bible of string algorithms.
I don't think there a reason to be worried about the worst case for two reasons:
You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.
If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.
That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.
In my experience there are three implementation that I think could met your requirement:
HAT-Trie (combination between trie and hashtable)
JudyArray (compressed n-ary tree)
Double Array Tree
You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.

When to switch from unordered lists to sorted lists ? [optimization]

I have to implement an algorithm to decompose 3D volumes in voxels. The algorithm starts by identifying which vertexes is on each side of the cut plan and in a second step which edge traverse the cutting plan.
This process could be optimized by using the benefit of sorted list. Identifying the split point is O log(n). But I have to maintain one such sorted list per axis and this for vertexes and edges. Since this is to be implemented to be used by GPU I also have some constrains on memory management (i.e. CUDA). Intrusive listsM/trees and C are imposed.
With a complete "voxelization" I expect to endup with ~4000 points, and 12000 edges. Fortunately this can be optimized by using a smarter strategy to get rid of processed voxels and order residual volumes cutting to keep their number to a minimum. In this case I would expect to have less then 100 points and 300 edges. This makes the process more complex to manage but could end up beeing more efficient.
The question is thus to help me identify the criteria to determine when the benefit of using a sorted data structure is worth the effort and complexity overhead compared to simple intrusive linked lists.
chmike, this really sounds like the sort of thing you want to do first the simpler way, and see how it behaves. Any sort of GPU voxelization approach is pretty fragile to system details once you get into big volumes at least (which you don't seem to have). In your shoes I'd definitely want the straightforward implementation first, if for no other reason that to check against....
The question will ALWAYS boil down to which operator is most common, accessing, or adding.
If you have an unordered list, adding to it takes no time, and accessing particular items takes extra time.
If you have a sorted list, adding to it takes more time, but accessing it is quicker.
Most applications spending most of their time accessing the data, rather than adding to it, which means that the (running) time overhead in creating a sorted list will usually be balanced or covered by the time saved in accessing the list.
If there is a lot of churn in your data (which it doesn't sound like there is) then maintaining a sorted list isn't necessarily advisable, because you will be constantly resorting the list as considerable CPU cost.
The complexity of the data structures only matters if they cannot be sorted in a useful way. If they can be sorted, then you'll have to go by the heuristic of
number of accesses:number of changes
to determine if sorting is a good idea.
After considering all answers I found out that the later method used to avoid duplicate computation would end up being less efficient because of the effort to maintain and navigate in the data structure. Beside, the initial method is straightforward to parallelize with a few small kernel routines and thus more appropriate for GPU implementation.
Checking back my initial method I also found significant optimization opportunities that leaves the volume cut method well behind.
Since I had to pick one answer I chose devinb because he answer the question, but Simon's comment, backed up by Tobias Warre comment, were as valuable for me.
Thanks to all of you for helping me sorting out this issue.
Stack overflow is an impressive service.

Resources