Exists lock-free hash tables that preserves insertion order? - c

I'm trying to optimize a library in which I use a lock-based hash table.
One way to do it is to substitute that lock-based structure with a lock-free one.
I found some algorithms about, and I decided to implement in C using this paper: Split-ordered lists: lock-free extensible hash tables
The problem is that this kind of structure does not preserve the insertion order of the elements, and I need this feature for two reasons:
1) to get the next element to the current one (in accordance with insertion order and not in hashkey order),
2) to replace old entries (with new ones) when the maximum number of elements in the ht is reached. This because I use the hash table like a buffer, and I want to take its size fixed.
So I ask you, all lock-free hash table's implementations suffers from this "lack-of-insertion-order" issue? Or there is a solution?

If memory isn't an issue, a simple way to implement this is by using an atomic reference. Modifications will copy the internal data structure, make the changes and then update the reference.
In a simple implementation, that means the last write wins and all other writes are "ignored". For more complex cases, you add a locking structure in the reference which allows to queue write operations.
So you pay with another level of indirection but get a very simple way to swap data structures and algorithms.
Since this approach works with any algorithm, you can select one which preserves order.

Related

Why insertion to an immutable trie only slower by a factor of constant time as apposed to using a mutable data structure?

Lee byron makes this point in the video, but I can't seem to find the part where he explains this.
https://www.youtube.com/watch?v=I7IdS-PbEgI&t=1604s
Is this because when you update a node you have traverse log(n) to get to the node. With an immutable structure and it must copy worst-case n nodes... That is as far as I get in my thinking.
If you would attempt to create an immutable list the simple way, the obvious solution would be to copy the whole list into a new list and exchange that single item. So a larger list would take longer to copy, right? The result would at least be O(n).
Immutable.js on the other hand uses a trie (see wikipedia) graph, which allows to reuse most of the structure while making sure, that existing references are not mutated.
Simply said, you create a new tree structure and create new branches for modified parts. When one of the branches is unchanged, the tree can just link to the original structure instead of copying.
The immutable.js documentation starts with two links to long descriptions, especially the one about vector tries is nice:
These data structures are highly efficient on modern JavaScript VMs by
using structural sharing via hash maps tries and vector
tries as popularized by Clojure and Scala, minimizing the need to
copy or cache data.
If you want to know more the details, you might want to take a look on the question about How Immutability is Implemented too.

Why should we use stack, since an array or linked list can do everything a stack can do

I was wondering why should we even use stack, since an array or linked list can do everything a stack can do? Why do we bother name it a "data structure" separately? In the real world, just use an array would be sufficient enough to solve the problem; why would one bother to implement a stack which will restrict himself to only be able to push and pop from the top of the collection?
I think it is better to use the term data type to refer to things whose behavior is defined by some interface, or algebra, or collection of operations. Things like
stacks
queues
priority queues
dequeues
lists
sets
hierarchies
maps (dictionaries)
are types because for each, we just list their behaviors. Stacks are stacks because they can only be pushed and popped. Queues are queues because ... (you get the picture).
On the other hand, a data structure is an implementation of a data type defined by the way it arranges its components. Examples include
array (constant time access by index)
linked list
bitmap
BSTs
Red-black tree
(a,b)-trees (e.g. 2-3 trees)
Skip lists
hash tables (many variants)
adjacency matrices
Bloom filters
Judy arrays
Tries
A lot of people do confuse the terms data structures and data types, but it's best not to be too pedantic. Stacks are data types, not data structures, but again, why be too pedantic.
To answer your specific question, we use stacks as data types in situations where we want to ensure or data is modified only by pushing and popping and we never violate this access pattern (even by accident).
Under the hoodm we may use a linked list as an implementation of our stack. But most programming languages will provide a way to restrict the interface to allow our code to readably, and securely, use our data in a LIFO fashion.
TL;DR: Readability. Security.
Stacks can and usually are implemented using an array or a linked list as the underlying structure. There are many benefits to using a Stack.
One of the main benefits is the LIFO ordering that is maintained by the push/pop operations. This allows for the stack elements to have a natural order and be removed from the stack in the reverse order to the order of their addition. Such data structure can be very useful in many applications where using just an array or a linked list would actually be less efficient and/or more complicated. Some of those applications are:
Balancing of symbols (such as parenthesis)
Infix-to-postfix conversion
Evaluation of postfix expression
Implementing function calls (including recursion)
Finding of spans (finding spans in stock markets)
Page-visited history in a Web browsed (Back buttons)
Undo sequence in a text editor
Matching Tags in HTML and XML
Here are some more stack applications
And some more...
The two underlying implementations of an array or a linked list give the stack different capabilities and features. Here is a simple comparison:
(Dynamic) Array Implementation:
1. All operations take constant time besides push().
2. Expensive doubling operation every once in a while.
3. Any sequence of n operations (starting from empty stack) -> "amortized" bound takes time proportional to n.
Linked List Implementation:
1. Grows and shrinks easily.
2. Every operation takes constant time O(1).
3. Every operation uses extra space and time to deal with references.

Which data structure I should use if the data is mostly sorted?

I have huge amount of data (mainly of type long long) which is mostly sorted (data is spread in different files and in each file data is in sorted format). I need to dump this data into a file in sorted manner. Which data structure should I use. I am thinking about BST.
Is there any other DS I should use which can give me the optimum performance ?
Thanks
Arpit
Using any additional data structure won't help. Since most of your data is already sorted and you just need to fix the occasional value, use a simple array to extract data, then use Insertion Sort.
Insertion sort runs in O(n) for mostly presorted data.
However, this depends if you can hold large enough an array in memory or not depending upon your input size.
Update:
I wasn't very clear on your definition of "mostly sorted". Generally it means only few elements are not in the precise sorted position.
However, as you stated further, 'data is in different files where each file is individually sorted', then may be it is a good candidate for the sub function call - Merge as in merge Sort.
Note that Merge routine, merges two already sorted arrays. If you have say 10 files where each of them is individually sorted for sure, then using Merge routine would only take O(n).
However, if you have even a few off instances where a single file is not perfectly sorted (on its own), you need to use Insertion Sort.
Update 2:
OP says he cannot use an array because he cannot know the number of records in advance. Using simple link list is out of question, since that never competes with arrays (sequential vs random access time) in time complexity.
Pointed out in comments, using link list is a good idea IF the files are individually sorted and all you need to run on them is the merge procedure.
Dynamically allocated arrays are best, if he can predict size at some point. Since c++ tag was used (only removed latter), going for vector would be a good idea, since it can re size comfortably.
Otherwise, one option might be Heap Sort, since it would call heapify first i.e. build a heap (so it can dynamically accommodate as many elements needed) and still produce O(nlogn) complexity. This is still better than trying to use a link list.
Perhaps you don't need a data structure at all.
If the files are already sorted, you can use the merge part of merge sort, which is O(n), or more generally O(n*log k), where k is the number of files.
How many files do you have to merge?
If it's only a few (on the order of a dozen or so) and each individual file is fully sorted, then you shouldn't need to build any sort of complex data structure at all: just open all the input files, read the next record from each file, compare, write the smallest to the destination, then replace that record from the appropriate file.
If each file is not fully sorted or if there are too many files to open at once, then yes, you will need to build an intermediate data structure in memory. I'd recommend a self-balancing tree, but since the data are already mostly sorted, you'll be re-balancing on almost every insert. A heap may work better for your purposes.
Best Sorting Algorithm:
Insertion sort can be used efficiently for nearly sorted data (O(n) time complexity).
Best data structure:
Linked list is the best choice for the data structure if you are sorting it using insertion sort.
Reason for using linked list:
Removing and inserting elements can be done faster when elements are stored as a linked list.

How are hash tables implemented internally in popular languages?

Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic "array with linked-list" method, or use a balanced tree?
I need a simple (fewer LOC) and fast method for indexing the symbols in a DSL written in C. Was wondering what others have found most efficient and practical.
The classic "array of hash buckets" you mention is used in every implementation I've seen.
One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is really readable.
Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There's also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under "HV". If you're truly adventurous you can dig into hv.c.
The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.
You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.
As for "most efficient and practical", use someone else's hash library. For god's sake don't write one yourself for production use. There's a hojillion robust and efficient ones out there already.
Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like 'array of buckets', but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.
I think this is way cool :-)
Attractive Chaos have a comparison of Hash Table Libraries and a update.
The source code is available and it is in C and C++
Balanced trees sort of defeat the purpose of hash tables since a hash table can provide lookup in (amortized) constant time, whereas the average lookup on a balanced tree is O(log(n)).
Separate chaining (array with linked list) really works quite well if you have enough buckets, and your linked list implementation uses a pooling allocator rather than malloc()ing each node from the heap individually. I've found it to be just about as performant as any other technique when properly tuned, and it is very easy and quick to write. Try starting with 1/8 as many buckets as source data.
You can also use open addressing with quadratic or polynomial probing, as Python does.
If you can read Java, you might want to check out the source code for its various map implementations, in particular HashMap, TreeMap and ConcurrentSkipListMap. The latter two keep the keys ordered.
Java's HashMap uses the standard technique you mention of chaining at each bucket position. It uses fairly weak 32-bit hash codes and stores the keys in the table. The Numerical Recipes authors also give an example (in C) of a hash table essentially structured like Java's but in which (a) you allocate the nodes of the bucket lists from an array, and (b) you use a stronger 64-bit hash code and dispense with storing keys in the table.
What Crashworks mean to say was....
The purpose of Hash tables are constant time lookup, addition and deletion. In terms of Algorithm, the operation for all operation is O(1) amortized.
Whereas in case you use tree ...the worst case operation time will be O(log n) for a balanced tree. N is the number of nodes. but, do we really have hash implemented as Tree?

Linked lists or hash tables?

I have a linked list of around 5000 entries ("NOT" inserted simultaneously), and I am traversing the list, looking for a particular entry on occasions (though this is not very often), should I consider Hash Table as a more optimum choice for this case, replacing the linked list (which is doubly-linked & linear) ?? Using C in Linux.
If you have not found the code to be the slow part of the application via a profiler then you shouldn't do anything about it yet.
If it is slow, but the code is tested, works, and is clear, and there are other slower areas that you can work on speeding up do those first.
If it is buggy then you need to fix it anyways, go for the hash table as it will be faster than the list. This assumes that the order that the data is traversed does not matter, if you care about what the insertion order is then stick with the list (you can do things with a hash table and keep the order, but that will make the code much tricker).
Given that you need to search the list only on occasion the odds of this being a significant bottleneck in your code is small.
Another data structure to look at is a "skip list" which basically lets you skip over a large portion of the list. This requires that the list be sorted however, which, depending on what you are doing, may make the code slower overall.
Whether using hash table is more optimum or not depends on the use case, which you have not described in detail. But more importantly, make sure the bottleneck of performance is in this part of the code. If this code is called only once in a while and not in a critical path, no use bothering to change the code.
Have you measured and found a performance hit with the lookup? A hash_map or hash table should be good.
If you need to traverse the list in order (not as a part of searching for elements, but say for displaying them) then a linked list is a good choice. If you're only storing them so that you can look up elements then a hash table will greatly outperform a linked list (for all but the worst possible hash function).
If your application calls for both types of operations, you might consider keeping both, and using whichever one is appropriate for a particular task. The memory overhead would be small, since you'd only need to keep one copy of each element in memory and have the data structures store pointers to these objects.
As with any optimization step that you take, make sure you measure your code to find the real bottleneck before you make any changes.
If you care about performance, you definitely should. If you're iterating through the thing to find a certain element with any regularity, it's going to be worth it to use a hash table. If it's a rare case, though, and the ordinary use of the list is not a search, then there's no reason to worry about it.
If you only traverse the collection I don't see any advantages of using a hashmap.
I advise against hashes in almost all cases.
There are two reasons; firstly, the size of the hash is fixed.
Second and much more importantly; the hashing algorithm. How do you know you've got it right? how will it behave with real data rather than test data?
I suggest a balanced b-tree. Always O(log n), no uncertainty with regard to a hash algorithm and no size limits.

Resources