Asymptotically Fast Associative Array with Low Memory Requirements - c

Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.
Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.
The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?
It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.

Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.

There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.
Dan Gusfield's book is the Bible of string algorithms.

I don't think there a reason to be worried about the worst case for two reasons:
You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.
If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.
That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.

In my experience there are three implementation that I think could met your requirement:
HAT-Trie (combination between trie and hashtable)
JudyArray (compressed n-ary tree)
Double Array Tree
You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.

Related

Why is Merge sort better for large arrays and Quick sort for small ones?

The only reason I see for using merge sort over quick sort is if the list was already (or mostly) sorted.
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
It would seem unintuitive to say that because of large or small data sets one is better than the other.
For example, quoting Geeksforgeeks article on that:
Merge sort can work well on any type of data sets irrespective of its size (either large or small).
whereas
The quick sort cannot work well with large datasets.
And next it says:
Merge sort is not in place because it requires additional memory space to store the auxiliary arrays.
whereas
The quick sort is in place as it doesn’t require any additional storage.
I understand that space complexity and time complexity are separate things. But it still is an extra step, and of course the fact that writing everything on a new array with large data sets it would take more time.
As for the pivoting problem, the bigger the data set, the lower the chance of picking the lowest or highest item (unless, again, it's an almost sorted list).
So why is it considered that merge sort is a better way of sorting large data sets instead of quick sort?
Why is Merge sort better for large arrays and Quick sort for small ones?
It would seem unintuitive to say that because of large or small data sets one is better than the other.
Assuming the dataset fits in memory (not paged out), the issue is not the size of the dataset, but a worst case pattern for a particular implementation of quick sort that result in O(n2) time complexity. Quick sort can use median of medians to guarantee worst case time complexity is O(n log(n)), but that ends up making it significantly slower than merge sort. An alternative is to switch to heap sort if the level of recursion becomes too deep, known as intro sort, and is used in some libraries.
https://en.wikipedia.org/wiki/Median_of_medians
https://en.wikipedia.org/wiki/Introsort
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
There are variations of merge sort that don't require any extra storage for data, but they tend to be about 50+% slower than standard merge sort.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
Every element of a sub-array will be compared to the pivot element. As the number of equal elements increases, Lomuto partition scheme gets worse, while Hoare partition scheme gets better. With a lot of equal elements, Hoare partition scheme needlessly swaps equal elements, but the check to avoid the swaps generally costs more time than just swapping.
sorting an array of pointers to objects
Merge sort does more moves but fewer compares than quick sort. If sorting an array of pointers to objects, only the pointers are being moved, but comparing objects requires deference of the pointers and what is needed to compare objects. In this case and other cases where compare takes more time than moves, merge sort is faster.
large datasets that don't fit in memory
For datasets too large to fit in memory, a memory base sort is used to sort "chunks" of the dataset that will fit into memory then written to external storage. Then the "chunks" on external storage are merged using a k-way merge to produce a sorted dataset.
https://en.wikipedia.org/wiki/External_sorting
I was trying to figure out which sorting algorithm (Merge/Quick) has a better time and memory efficiency when the input data size becomes increasing, Then I write a code that generates a list of random numbers and sorts the list by both algorithms. after that, the program generates 5 txt files that record the random numbers with 1M,2M,3M,4M,5M length (M- stands for Millions)then I got the following results.
Execution Time in seconds:
Execution Time in seconds graphical Interpretation:
Memory Usage in KB:
Memory Usage in KB Graphical Interpretation:
if you want the code here is the Github repo. https://github.com/Nikodimos/Merge-and-Quick-sort-algorithm-using-php
In my scenario Merge sort becomes efficient when the file size becomes increase.
In addition to rcgldr's detailed response I would like to underscore some extra considerations:
large and small is quite relative: in many libraries, small arrays (with fewer than 30 to 60 elements) are usually sorted with insertion sort. This algorithm is simpler and optimal if the array is already sorted, albeit with a quadratic complexity in the worst case.
in addition to space and time complexities, stability is a feature that may be desirable, even necessary in some cases. Both Merge Sort and Insertion Sort are stable (elements that compare equal remain in the same relative order), whereas it is very difficult to achieve stability with Quick Sort.
As you mentioned, Quick Sort has a worst case time complexity of O(N2) and libraries do not implement median of medians to curb this downside. Many just use median of 3 or median of 9 and some recurse naively on both branches, paving the way for stack overflow in the worst case. This is a major problem as datasets can be crafted to exhibit worst case behavior, leading to denial of service attacks, slowing or even crashing servers. This problem was identified by Doug McIlroy in his famous 1999 paper A Killer Adversary for Quicksort. Implementations are available and attacks have been perpetrated using this technique (cf this discussion).
Almost sorted arrays are quite common in practice and neither Quick sort nor Merge sort treat them really efficiently. Libraries now use more advanced combinations of techniques such as Timsort to achieve better performance and stability.

Scanning File to find exact size of HashTable size vs constantly resizing Array and ReHashing and other questions

So I am doing a project, that requires me to find all the anagrams in a given file. Each file has words on each line.
What I have done so far:
1.) sort the word (using Mergesort - (I think this is the best in the worst case.. right?))
2.) place into the hashtable using a hash function
3.) if there is a collision move to the next available space further in the array (basically going down one by one until you see an empty spot in the hashtable) (is there a better way for this? What I am doing in linear probing).
Problem:
When it runs out of space in the hash table.. what do I do? I came up with two solutions, either scan the file before inputing anything into the hash table and have one exact size or keep resizing the array and rehashing as it get more and more full. I don't know which one to choose. Any tips would be helpful.
A few suggestions:
Sorting is often a good idea, and I can think of a way to use it here, but there's no advantage to sorting items if all you do afterwards is put them into a hashtable. Hashtables are designed for constant-time insertions even when the sequence of inserted items is in no particular order.
Mergesort is one of several sorting algorithms with O(nlog n) worst-case complexity, which is optimal if all you can do is compare two elements to see which is smaller. If you can do other operations, like index an array, O(n) sorting can be done with radixsort -- but it's almost certainly not worth your time to investigate this (especially as you may not even need to sort at all).
If you resize a hashtable by a constant factor when it gets full (e.g. doubling or tripling the size) then you maintain constant-time inserts. (If you resize by a constant amount, your inserts will degrade to linear time per insertion, or quadratic time over all n insertions.) This might seem wasteful of memory, but if your resize factor is k, then the proportion of wasted space will never be more than (k-1)/k (e.g. 50% for doubling). So there's never any asymptotic execution-time advantage to computing the exact size beforehand, although this may be some constant factor faster (or slower).
There are a variety of ways to handle hash collisions that trade off execution time versus maximum usable density in different ways.

C: Storing up to a million entries in a hash table

I'm working on a project where efficiency is crucial. A hash table would be very helpful since I need to easily look up the memory address of a node based on a key. The only problem I foresee is this hash table will need to handle up to 1 million entries. As I understand it usually hash tables buckets are a linked list so that they can handle multiple entries in the same bucket. It seems to me that with a million entries these lists would be way too slow. What is the common way of implementing something like this. Maybe swapping a standard linked list out for a skip list?
If you want a hash table with a million entries, normally you'd have at least 2 million buckets. I don't remember all the statistics (the key term is "birthday paradox"), but the vast majority of the buckets will have zero or one items. You can, in principle, be very unlucky and get all items in one bucket - but you'd have to be even more unlucky than those people who seem to get struck by lightning every other day.
For hashtables that grow, the normal trick is to grow by a constant percentage - the usual textbook case being growth by doubling the hash-table size. You do this whenever the number of items in the hashtable reaches a certain proportion of the hashtable size, irrespective of how many buckets are actually being used. This gives amortized expected performance of O(1) for inserts, deletes and searches.
The linked list in each bucket of a hash-table is just a way of handling collisions - improbable in a per-operation sense, but over the life of a significant hash table, they do happen - especially as the hash-table gets more than half full.
Linked lists aren't the only way to handle collisions - there's a huge amount of lore about this topic. Walter Bright (developer of the D programming language) has advocated using binary trees rather than linked lists, claiming that his Dscript gained a significant performance boost relative to Javascript from this design choice.
He used simple (unbalanced) binary trees when I asked, so the worst-case performance was the same as for linked lists, but the key point I guess is that the binary tree handling code is simple, and the hash table itself makes the odds of building large unbalanced trees very small.
In principle, you could just as easily use treaps, red-black trees or AVL trees. An interesting option may be to use splay trees for collision handling. But overall, this is a minor issue for a few library designers and a few true obsessives to worry about.
You lose all the advantages of a hash table if the per-bucket lists ever have more than a few entries. The usual way to make a hash table scale to millions of entries is to make the primary hash array resizable, so even with millions of entries, the bucket lists stay short.
You can use a Tree instead of a List in the individual "buckets". (AVL or similar)
EDIT: well, Skip List would do too. (and seems to be faster) - O(log n) is what you aim for.
The total number of entries does not matter, only the average number of entries per bucket (N / size of hash). Use a hash function with larger domain (for example, 20 bits, or even larger) to ensure that.
Of course, this will take up more memory, but that's it, it's a common memory vs speed tradeoff.
Not sure if this will help you or not, but maybe: http://memcached.org/
If your keys have normal distribution (That's a very big IF), then the expected number of insertions into the hashtable to exhaust all the buckets in the hashtable is M*logM ( Natural log, to the base e), where M is the number of buckets.
Was surprised couldn't find this easily online!
I have posted the derivation of the same on my blog,and verified it with Code, using rand().It does seem to be a pretty good estimate.

Why does searching an index have logarithmic complexity?

Is an index not similar to a dictionary? If you have the key, you can immediately access it?
Apparently indexes are sometimes stored as B-Trees... why is that?
Dictionaries are not implicitly sorted, B-Trees are.
A B-Tree index can be used for ranged access, like this:
WHERE col1 BETWEEN value1 AND value2
or ordering, like this:
ORDER BY col1
You cannot immediately access a page in a B-Tree index: you need to traverse the child pages whose number increases logarithmically.
Some databases support dictionary-type indexes as well, namely, HASH indexes, in which case the search time is constant. But such indexes cannot be used for ranged access or ordering.
Database Indices are usually (almost always) stored as B-Trees. And all balanced tree structures have O(log n) complexity for searching.
'Dictionary' is an 'Abstract Data Type' (ADT), ie it is a functional description that does not designate an implementation. Some dictionaries could use a Hashtable for O(1) lookup, others could use a tree and achieve O(log n).
The main reason a DB uses B-trees (over any other kind of tree) is that B-trees are self-balancing and are very 'shallow' (requiring little disk I/O)
One of the only data structures you can access immediately with a key is a vector, which for a massive amount of data, becomes inefficient when you start inserting and removing elements. It also needs contiguous memory allocation.
A hash can be efficient but needs more space and will end up with collisions.
A B tree has a good balance between search performance and space.
If your only queries are equality tests then, its true, dictionaries are a good choice since they will do lookups in amortized O(1) time. However, if you want to extend queries to involve range checks, eg (select * from students where age > 10) then suddenly your dictionaries lose their advantage completely.. This is where tree-based indexes come in. With a tree-based index you can perform equalities and range checks in logarithmic time.
There is one problem with naive tree structures. They get unbalanced, this means that after adding certain values to the index, the tree structure become lopsided (ex like a long line) and lookups start to take O(N) time again. This can be resolved by balancing your tree. The B-Tree is one such approach which also takes advantage of systems capable of doing large blocks of I/O, and so is most appropriate for databases.
You can achieve O(1) if you pre-allocate N entries an array and hash the key to this N values.
But then if you have more than N entries stored there are collision. So for each key in the array you have a list of value. So it's not exactly O(1) anymore. Scanning the list itself will be O(m) where m is the average number of collision.
Example with hash = n mod 3
0 --> [0,a] [3,b] ...
1 --> [1,a] [4,b] [7,b] ...
2 --> [2,a] [5,b] ...
At a point in time, it becomes some bad that you spend more time traversing the list of value for a potential key than using another structure with O(log n) lookup time, where n is the total number of entries.
You could of course pick N so big that the array/hash would be faster than the B-Tree. But the array has a fixed pre-allocated size. So if N=1000 and you store 3 values, you've wasted 997 slots in the array.
So it's essentially a performance-space trade-off. For small set of value, array and hashing is excellent. For large set of value, B-Tree are most efficient.
Hashes are the fastest look up data structures, but have some problems:
a) are not sorted
b) no matter how good the hash is, will have collisions, that becomes problematic when lots of data
c) to find a hash value in a hash indexed file takes a long time, so most of the time hashes make sense only for in memory (RAM) data - which makes them not suitable for databases- that most of the time cannot fit all data in RAM
Sorted trees address these issues, and b-trees operations in particular can be implemented efficiently using files. The only drawback is they have slower lookup times as a hash structure.
No data structure is perfect in all cases, depending on estimated size of data and how you use it, one is better.
hashindex (eg. in mysql and postgres) has constant complexity (O(1)) for search.
CREATE INDEX ... USING HASH

Data Structure to store billions of integers

What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.
Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.
my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).
Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work
If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.
We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.

Resources