Why algorithm perfomance get worst with HashSet.add? - hashset

I'm working with an algorithm that has to read a file with 1 million lines and store some informations about this file. I found the HashSet structure that adds, remove and finds any data in O(1) performance. But, when i execute the algorithm with the line that add the data into the HashSet, the algorithm execution time became more than 4x worst. The HashSet performance become worst when we insert too many data into it?

Different HashSet implementations can vary on performance. First of all there is a need for either some kind of a tree or a set of buckets, both has it's own performance cost. Theoratically the hash datastructures are fast, but reality can be much different. Even O(1) means that the execution time is independent of the number of elements, but it does not mean it's free or fast.

Related

Why is Merge sort better for large arrays and Quick sort for small ones?

The only reason I see for using merge sort over quick sort is if the list was already (or mostly) sorted.
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
It would seem unintuitive to say that because of large or small data sets one is better than the other.
For example, quoting Geeksforgeeks article on that:
Merge sort can work well on any type of data sets irrespective of its size (either large or small).
whereas
The quick sort cannot work well with large datasets.
And next it says:
Merge sort is not in place because it requires additional memory space to store the auxiliary arrays.
whereas
The quick sort is in place as it doesn’t require any additional storage.
I understand that space complexity and time complexity are separate things. But it still is an extra step, and of course the fact that writing everything on a new array with large data sets it would take more time.
As for the pivoting problem, the bigger the data set, the lower the chance of picking the lowest or highest item (unless, again, it's an almost sorted list).
So why is it considered that merge sort is a better way of sorting large data sets instead of quick sort?
Why is Merge sort better for large arrays and Quick sort for small ones?
It would seem unintuitive to say that because of large or small data sets one is better than the other.
Assuming the dataset fits in memory (not paged out), the issue is not the size of the dataset, but a worst case pattern for a particular implementation of quick sort that result in O(n2) time complexity. Quick sort can use median of medians to guarantee worst case time complexity is O(n log(n)), but that ends up making it significantly slower than merge sort. An alternative is to switch to heap sort if the level of recursion becomes too deep, known as intro sort, and is used in some libraries.
https://en.wikipedia.org/wiki/Median_of_medians
https://en.wikipedia.org/wiki/Introsort
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
There are variations of merge sort that don't require any extra storage for data, but they tend to be about 50+% slower than standard merge sort.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
Every element of a sub-array will be compared to the pivot element. As the number of equal elements increases, Lomuto partition scheme gets worse, while Hoare partition scheme gets better. With a lot of equal elements, Hoare partition scheme needlessly swaps equal elements, but the check to avoid the swaps generally costs more time than just swapping.
sorting an array of pointers to objects
Merge sort does more moves but fewer compares than quick sort. If sorting an array of pointers to objects, only the pointers are being moved, but comparing objects requires deference of the pointers and what is needed to compare objects. In this case and other cases where compare takes more time than moves, merge sort is faster.
large datasets that don't fit in memory
For datasets too large to fit in memory, a memory base sort is used to sort "chunks" of the dataset that will fit into memory then written to external storage. Then the "chunks" on external storage are merged using a k-way merge to produce a sorted dataset.
https://en.wikipedia.org/wiki/External_sorting
I was trying to figure out which sorting algorithm (Merge/Quick) has a better time and memory efficiency when the input data size becomes increasing, Then I write a code that generates a list of random numbers and sorts the list by both algorithms. after that, the program generates 5 txt files that record the random numbers with 1M,2M,3M,4M,5M length (M- stands for Millions)then I got the following results.
Execution Time in seconds:
Execution Time in seconds graphical Interpretation:
Memory Usage in KB:
Memory Usage in KB Graphical Interpretation:
if you want the code here is the Github repo. https://github.com/Nikodimos/Merge-and-Quick-sort-algorithm-using-php
In my scenario Merge sort becomes efficient when the file size becomes increase.
In addition to rcgldr's detailed response I would like to underscore some extra considerations:
large and small is quite relative: in many libraries, small arrays (with fewer than 30 to 60 elements) are usually sorted with insertion sort. This algorithm is simpler and optimal if the array is already sorted, albeit with a quadratic complexity in the worst case.
in addition to space and time complexities, stability is a feature that may be desirable, even necessary in some cases. Both Merge Sort and Insertion Sort are stable (elements that compare equal remain in the same relative order), whereas it is very difficult to achieve stability with Quick Sort.
As you mentioned, Quick Sort has a worst case time complexity of O(N2) and libraries do not implement median of medians to curb this downside. Many just use median of 3 or median of 9 and some recurse naively on both branches, paving the way for stack overflow in the worst case. This is a major problem as datasets can be crafted to exhibit worst case behavior, leading to denial of service attacks, slowing or even crashing servers. This problem was identified by Doug McIlroy in his famous 1999 paper A Killer Adversary for Quicksort. Implementations are available and attacks have been perpetrated using this technique (cf this discussion).
Almost sorted arrays are quite common in practice and neither Quick sort nor Merge sort treat them really efficiently. Libraries now use more advanced combinations of techniques such as Timsort to achieve better performance and stability.

If insert into static array is Space-O(1), how does space complexity work?

Copy of static array is SO(N) (because the space required for this operation scales linearly with N)
Insert into static array is SO(1)
'because although it needs to COPY the first array and add the new element, after the copy it frees the space of the first array' - a quote from the source I'm learning about standard collections' TS complexity.
When we are dealing with time complexity, if the algorithm houses O(N) operation, the whole time complexity of the algorithm is AT LEAST O(N).
I struggle to understand how exactly do we measure Space complexity. I thought we look at 'the difference in memory after the completion of the run of the algorithm to determine whether it scales with N' - that would constitute insert into static array being SO(1).
But then, by that method, if I have an algorithm, that through the course of its run uses N! space to get a single value, and at the very end I would clean up the memory allocated for that N! items, the algorithm would be Space-O(1).
Actually, every single algorithm, that does not directly deal with entities, that remain after its run course would be O(1) (cause as we do not need the created entities after the algorithm, we can clean up the memory at the end).
Please help me understand the situation here. I know, in the real-world complexity analysis, we sometimes indulge in technical hypocrisy (like claiming Big-O (worst case scenario) of get from HashTable is Time-O(1), when it's actually O(N), but rare enough for us to claim it relevant, or insert into the end of Dynamic array is Time-O(1), when it's also O(N), but the Amortized analysis claims it's as well rare enough to claim it's Time-O(1)).
Is it one of these situations, when the insertion into static array is actually Space-O(N), but we take it for granted it's Space-O(1), or do I misunderstand, how space complexity works?
Space complexity simply describes how memory consumption is related to the size of input data.
In case of SO(1), the memory consumption of a program does not depend on the size of input. Example of this would be "grep", "sed" and similar tools that operate on streams. For a bigger dataset they would simply run more time. In Java for example you have SAX XML Parser which does not build DOM model in memory, but emits events for XML elements as it sees them; thus it can work with very large XML documents even on machines with limited memory.
In case of SO(N), memory consumption of a program grows linearly with the size of input. So if a program consumes 500 MB for 200 records and 700 MB for 400 records, then you know it will consume 1 GB for 700 records. Most programs actually fall in this category. For example, an email that is twice as long as some other email will consume twice as much space.
Programs that consume beyond SO(N) usually deal with complex data analysis, such as machine learning, AI, or data science. There you want to study relationships between individual records, so you sometimes build higher dimensional "cubes" and you get SO(N^2), SO(N^3) etc.
So technically it's not correct to say that "insertion into a static array is SO(N)". The correct thing to say would be that "an algorithm for XYZ has space complexity SO(N)".
Time and space complexity are important when you are sizing your infrastructure. For example, you have tested your system on 50 users, but you know that in production there will be 10000 users, so you ask yourself "how much RAM, disk and CPU do I need to have in my server".
In many cases, time and space can be traded for each other. There's no better way to see this than in databases. For example, a table can have indexes which consume some additional space, but give you a huge speedup for looking up things because you don't have to search through all the records.
Another example of trading time for space would be cracking password hashes using rainbow tables. There you are essentially pre-computing some data which later saves you a lot of work, as you don't have to perform those expensive computations during cracking, but only look up the already calculated result.

Scanning File to find exact size of HashTable size vs constantly resizing Array and ReHashing and other questions

So I am doing a project, that requires me to find all the anagrams in a given file. Each file has words on each line.
What I have done so far:
1.) sort the word (using Mergesort - (I think this is the best in the worst case.. right?))
2.) place into the hashtable using a hash function
3.) if there is a collision move to the next available space further in the array (basically going down one by one until you see an empty spot in the hashtable) (is there a better way for this? What I am doing in linear probing).
Problem:
When it runs out of space in the hash table.. what do I do? I came up with two solutions, either scan the file before inputing anything into the hash table and have one exact size or keep resizing the array and rehashing as it get more and more full. I don't know which one to choose. Any tips would be helpful.
A few suggestions:
Sorting is often a good idea, and I can think of a way to use it here, but there's no advantage to sorting items if all you do afterwards is put them into a hashtable. Hashtables are designed for constant-time insertions even when the sequence of inserted items is in no particular order.
Mergesort is one of several sorting algorithms with O(nlog n) worst-case complexity, which is optimal if all you can do is compare two elements to see which is smaller. If you can do other operations, like index an array, O(n) sorting can be done with radixsort -- but it's almost certainly not worth your time to investigate this (especially as you may not even need to sort at all).
If you resize a hashtable by a constant factor when it gets full (e.g. doubling or tripling the size) then you maintain constant-time inserts. (If you resize by a constant amount, your inserts will degrade to linear time per insertion, or quadratic time over all n insertions.) This might seem wasteful of memory, but if your resize factor is k, then the proportion of wasted space will never be more than (k-1)/k (e.g. 50% for doubling). So there's never any asymptotic execution-time advantage to computing the exact size beforehand, although this may be some constant factor faster (or slower).
There are a variety of ways to handle hash collisions that trade off execution time versus maximum usable density in different ways.

Time Complexity of Hash Tables in C

I'm fairly new to the the concept of hash tables, and I've been reading up on different types of hash table lookup and insertion techniques.
I'm wondering what the difference is between the time complexities of linear probing, chaining, and quadratic probing?
I'm mainly interested in the the insertion, deletion, and search of nodes in the hash table. So if I graph the system time per process ( insert/search/delete process ) versus the process number, how would the graphs differ?
I'm guessing that:
- quadratic probing would be worst-case O(nlogn) or O(logn) for searching
- linear probing would be worst-case O(n) for search
- Not sure but I think O(n^2) for chaining
Could someone confirm this? Thanks!
It's actually surprisingly difficult to accurately analyze all of these different hashing schemes for a variety of reasons. First, unless you make very strong assumptions on your hash function, it is difficult to analyze the behavior of these hashing schemes accurately, because different types of hash functions can lead to different performances. Second, the interactions with processor caches mean that certain types of hash tables that are slightly worse in theory can actually outperform hash tables that are slightly better in theory because their access patterns are better.
If you assume that your hash function looks like a truly random function, and if you keep the load factor in the hash table to be at most a constant, all of these hashing schemes have expected O(1) lookup times. In other words, each scheme, on expectation, only requires you to do a constant number of lookups to find any particular element.
In theory, linear probing is a bit worse than quadratic hashing and chained hashing, because elements tend to cluster near one another unless the hash function has strong theoretical properties. However, in practice it can be extremely fast because of locality of reference: all of the lookups tend to be close to one another, so fewer cache misses occur. Quadratic probing has fewer collisions, but doesn't have as good locality. Chained hashing tends to have extremely few collisions, but tends to have poorer locality of reference because the chained elements are often not stored contiguously.
In the worst case, all of these data structures can take O(n) time to do a lookup. It's extremely unlikely for this to occur. In linear probing, this would require all the elements to be stored continuously with no gaps and you would have to look up one of the first elements. With quadratic hashing, you'd have to have a very strange looking set of buckets in order to get this behavior. With chained hashing, your hash function would have to dump every single element into the same bucket to get the absolute worst-case behavior. All of these are exponentially unlikely.
In short, if you pick any of these data structures, you are unlikely to get seriously burned unless you have a very bad hash function. I would suggest using chained hashing as a default since it has good performance and doesn't hit worst-case behavior often. If you know you have a good hash function, or have a small hash table, then linear probing might be a good option.
Hope this helps!

Asymptotically Fast Associative Array with Low Memory Requirements

Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.
Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.
The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?
It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.
Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.
There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.
Dan Gusfield's book is the Bible of string algorithms.
I don't think there a reason to be worried about the worst case for two reasons:
You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.
If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.
That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.
In my experience there are three implementation that I think could met your requirement:
HAT-Trie (combination between trie and hashtable)
JudyArray (compressed n-ary tree)
Double Array Tree
You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.

Resources