Sorting tuples with divide-and-conquer algorithm inside DB - database

I saw a lecture explaining that the database engine uses the divide-and-conquer algorithm for sorting to split the data set into separate runs and then sort them if the data that we need to sort couldn't fit in memory.
In this algorithm when the pages are sorted and then written to disk, then merge the pages and compare each value, the output page is written to disk, this will finally merge two big pages which contain all tuples, my question is if the size of the tuple is very large more than memory size how will it handle at the last point when merging two big pages if these two pages size is large than memory size?

Related

Why is Merge sort better for large arrays and Quick sort for small ones?

The only reason I see for using merge sort over quick sort is if the list was already (or mostly) sorted.
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
It would seem unintuitive to say that because of large or small data sets one is better than the other.
For example, quoting Geeksforgeeks article on that:
Merge sort can work well on any type of data sets irrespective of its size (either large or small).
whereas
The quick sort cannot work well with large datasets.
And next it says:
Merge sort is not in place because it requires additional memory space to store the auxiliary arrays.
whereas
The quick sort is in place as it doesn’t require any additional storage.
I understand that space complexity and time complexity are separate things. But it still is an extra step, and of course the fact that writing everything on a new array with large data sets it would take more time.
As for the pivoting problem, the bigger the data set, the lower the chance of picking the lowest or highest item (unless, again, it's an almost sorted list).
So why is it considered that merge sort is a better way of sorting large data sets instead of quick sort?
Why is Merge sort better for large arrays and Quick sort for small ones?
It would seem unintuitive to say that because of large or small data sets one is better than the other.
Assuming the dataset fits in memory (not paged out), the issue is not the size of the dataset, but a worst case pattern for a particular implementation of quick sort that result in O(n2) time complexity. Quick sort can use median of medians to guarantee worst case time complexity is O(n log(n)), but that ends up making it significantly slower than merge sort. An alternative is to switch to heap sort if the level of recursion becomes too deep, known as intro sort, and is used in some libraries.
https://en.wikipedia.org/wiki/Median_of_medians
https://en.wikipedia.org/wiki/Introsort
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
There are variations of merge sort that don't require any extra storage for data, but they tend to be about 50+% slower than standard merge sort.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
Every element of a sub-array will be compared to the pivot element. As the number of equal elements increases, Lomuto partition scheme gets worse, while Hoare partition scheme gets better. With a lot of equal elements, Hoare partition scheme needlessly swaps equal elements, but the check to avoid the swaps generally costs more time than just swapping.
sorting an array of pointers to objects
Merge sort does more moves but fewer compares than quick sort. If sorting an array of pointers to objects, only the pointers are being moved, but comparing objects requires deference of the pointers and what is needed to compare objects. In this case and other cases where compare takes more time than moves, merge sort is faster.
large datasets that don't fit in memory
For datasets too large to fit in memory, a memory base sort is used to sort "chunks" of the dataset that will fit into memory then written to external storage. Then the "chunks" on external storage are merged using a k-way merge to produce a sorted dataset.
https://en.wikipedia.org/wiki/External_sorting
I was trying to figure out which sorting algorithm (Merge/Quick) has a better time and memory efficiency when the input data size becomes increasing, Then I write a code that generates a list of random numbers and sorts the list by both algorithms. after that, the program generates 5 txt files that record the random numbers with 1M,2M,3M,4M,5M length (M- stands for Millions)then I got the following results.
Execution Time in seconds:
Execution Time in seconds graphical Interpretation:
Memory Usage in KB:
Memory Usage in KB Graphical Interpretation:
if you want the code here is the Github repo. https://github.com/Nikodimos/Merge-and-Quick-sort-algorithm-using-php
In my scenario Merge sort becomes efficient when the file size becomes increase.
In addition to rcgldr's detailed response I would like to underscore some extra considerations:
large and small is quite relative: in many libraries, small arrays (with fewer than 30 to 60 elements) are usually sorted with insertion sort. This algorithm is simpler and optimal if the array is already sorted, albeit with a quadratic complexity in the worst case.
in addition to space and time complexities, stability is a feature that may be desirable, even necessary in some cases. Both Merge Sort and Insertion Sort are stable (elements that compare equal remain in the same relative order), whereas it is very difficult to achieve stability with Quick Sort.
As you mentioned, Quick Sort has a worst case time complexity of O(N2) and libraries do not implement median of medians to curb this downside. Many just use median of 3 or median of 9 and some recurse naively on both branches, paving the way for stack overflow in the worst case. This is a major problem as datasets can be crafted to exhibit worst case behavior, leading to denial of service attacks, slowing or even crashing servers. This problem was identified by Doug McIlroy in his famous 1999 paper A Killer Adversary for Quicksort. Implementations are available and attacks have been perpetrated using this technique (cf this discussion).
Almost sorted arrays are quite common in practice and neither Quick sort nor Merge sort treat them really efficiently. Libraries now use more advanced combinations of techniques such as Timsort to achieve better performance and stability.

Time complexity of sequentially scanning an array vs a linked list?

Since the elements of an array are stored contiguously in memory, I understand that sequentially scanning all the elements of an array would be much faster than in a linked list of same size. For the former you only have to increment some index variable and then read the value at that index whereas for LL you have to read pointers and then fetch data at non-contiguous memory addresses. My question is specifically how we would categorise these two cases in terms of time complexity?
For scanning array, does perfoming n * random accesses i.e. O(1) operations mean that overall it becomes O(n)? In that case wouldn't both be O(n)?
Sorry if this question doesn't make sense, maybe my understanding of time complexities isn't so good.
You are correct that
sequentially scanning a linked list or an array takes time O(n), and that
it's much faster to scan an array than a linked list due to locality of reference.
How can this be? The issue has to do with what you're counting with that O(n). Typically, when doing an analysis of an algorithm we assume that looking up a location in memory takes time O(1), and since in both cases you're doing O(n) memory lookups the total time is O(n).
However, the assumption that all memory lookups take the same amount of time is not a very good one in practice, especially with linked structures. You sometimes see analyses performed that do this in a different way. We assume that, somewhere in the machine, there's a cache that can hold B elements at any one time. Every time we do a memory lookup, if it's in cache, it's (essentially) free, and if it's not in cache then we do some work to load that memory address - plus the contents of memory around that location - into cache. We then only care about how many times we have to load something into the cache, since that more accurately predicts the runtime.
In the case of a linked list, where cells can be scattered randomly throughout memory, we'd expect to do Θ(n) memory transfers when scanning a linked list, since we basically will never expect to have a linked list cell already in cache. However, with an array, every time we find an array element not in cache, we pull into cache all the adjacent memory locations, which then means the next few elements will definitely be in the cache. This means that only (roughly) every 1/B lookups will trigger a cache miss, so we expect to do Θ(n / B) memory transfers. This predicts theoretically what we see empirically - it's much faster to sequentially scan an array than a linked list.
So to summarize, this is really an issue of what you're measuring and how you measure it. If you just count memory accesses, both sequential accesses will require O(n) memory accesses and thus take O(n) time. However, if you just care about cache misses, then sequential access of a dynamic array requires Θ(n / B) transfers while a linked list scan requires Θ(n) transfers, which is why the linked list appears to be slower.
This sort of analysis is often used when designing data structures for databases. The B-tree (and its relative the B+-tree), which are used extensively in databases and file systems, are specifically tuned around the size of a disk page to reduce memory transfers. More recent work has been done to design "cache-oblivious" data structures that always take optimal advantage of the cache even without knowing its size.
Unfortunatelly you missunderstood how these things work.
Sequencialy scan all array elements is O(n), being n the size of
the array, since you will visit each element. You will need to calculate each address and the fetch the data n times;
Sequencialy scan all linked list elements is O(n), being n the size of the linked list, since you will visit each element throught the links;
Acessing one element of an array is O(1), since the access is related to one memory address calculation and one fetch process;
Acessing one element of an linked lisk is O(n), being n the position that you want to access, because you need to arrive on the nth element hopping each link until you reach the desired element.
Accessing the value at a certain index, let's say, 500, in an array is "immediate" (O(1)) while, with a linked list, you must iterate over 500 nodes to get the wanted one (O(n)).
Therefore, with an array, an indice at the beginning or at the end of the container is accessible at the same speed while with a linked list, more the index is high more it takes time to get it.
At the contrary, inserting a node in a linked list is easy and fast, while doing the same in an array is slower.
So the question becomes what is the more common operation : accessing indices (writing, reading) or manipulating the container structure (inserting, removing) ? The answer seems obvious but it could be some cases where it's not.

Linked list insertion/deletion efficiency

Traditionally, linked lists are recommended over arrays when we want to perform insertions/deletions at random locations.This is because while using linked list(singly linked list), we just have to change the next and previous pointers of the adjacent nodes. Whereas in arrays, we have to shove numerous elements to make space for the new element(in case of insertion).
However, the process of finding the location of insertion/deletion in case of linked list is very costly(sequential search) as compared to arrays(random access), specially when we have large data.
Does this factor significantly decrease the efficiency of insertion/deletion in a linked lists over arrays? Or is the time required to shove the elements in case of an array a bigger problem than sequential access?
However, the process of finding the location of insertion/deletion in
case of linked list is very costly(sequential search) as compared to
arrays(random access), specially when we have large data.
Random access doesn't help anything if you are searching for an element and don't know where it is, and if you do know where it is and have like a pointer or index to it, there's no longer a search involved to access the element whether you're using linked lists or arrays. Only case where random-access helps in this case is if the array is sorted, in which case the random-access enables a binary search.
Does this factor significantly decrease the efficiency of
insertion/deletion in a linked lists over arrays? Or is the time
required to shove the elements in case of an array a bigger problem
than sequential access?
Generally not at least with unordered sequences since, again, both arrays and linked lists require a linear-time search to find an element. And if people had a need for searching frequently in their critical paths for non-trivial input sizes, often people use hash tables or balanced binary trees or tries or something of that sort instead.
Often arrays are preferred over linked lists in a lot of performance-critical fields for reasons that don't relate to algorithmic complexity. It's because arrays are guaranteed to contiguously store their elements. That provides very good locality of reference for sequential processing.
There are also ways to remove from arrays in constant-time. As one example, if you want to remove the nth element from an array in constant-time, just swap it with the last element in the array and remove the last one in constant-time. You don't necessarily have to shuffle all the elements over to close a gap if you're allowed to reorder the elements.
Linked lists may or may not store their nodes contiguously. They often become a lot more useful in performance-critical contexts if they do, like if they store their nodes in an array (either through an array-based container or allocator). Otherwise traversing them can lead to cache misses galore with potentially a cache miss for every single node being accessed.
the process of finding the location of insertion/deletion in case of linked list is very costly(sequential search) as compared to arrays (random access)
The comparison is wrong, since you are comparing the efficiency of insertion/deletion operations. Instead compare these two factors:
Sequential search in a linked-list having time complexity O(n)
Copy array elements in order to shove. May require to copy upto n number of array elements.
In array : If the underlying type is POD it can just realloc, but if not it must move them with the object's operator=.
So you can see that not everything is in favor of array usage. Linked List obviates the need to copy the same data again and again.
, specially when we have large data.
That would mean more number of array elements to be copied while shoving.

Why does Lucene use arrays instead of hash tables for its inverted index?

I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?
Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.
The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).
So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?
Well, I will speculate here (should probably be a comment - but it's going to be too long).
HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.
Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.
Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.
Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).
I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.
And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.
I was thinking of the reasoning behind it. Just thought of one use-case that was important in the context of text search. I could be totally wrong :)
Why sorted array and not Dictionary?
Yes, it performs well on range queries, but IMO Lucene was mainly built for text searches. Now imagine you were to do a search for prefix-based queries Eg: country:Ind*, you will need to scan the whole HashMap/Dictionary. Whereas this becomes log(n) if you have a sorted array.
Since we have a sorted array, it would be inefficient to update the array. Hence, in Lucene segments(inverted index resides in segments) are immutable.

Best suitable sorting algorithm

I have a hashtable that may contain around 1-5 million records. I need to iterate over it to select some of the entries in it and then sort them in some order. I was planning to use a linked list to maintain a list of pointers to the entries in the hashtable that I have to sort. But using the linked list, the only available good option for sorting that I came across is merge sort. But considering that the list may contain around 5 million records, should merge sort be used? I have no restriction to use linked list only to maintain the list of pointers. I can also use arrays so that I can use heap sort. But deciding on the size of this array would be a challenging task considering that this complete operation is pretty frequent and different instances of it can run in parallel. Also, the number of entries that are filtered out from the hashtable for sorting can vary from 1 to almost all records in the hastable. Can someone suggest me what approach would best fit here?
Try the simplest approach first:
Implement a typical dynamic array, using realloc() for growth, and perhaps using the typical double-allocation-when-growing scheme. Growing to a million elements will then take about 20 re-allocations.
Sort the array with qsort().
Then profile that, and see where it hurts. If you're not memory-sensitive, increase the initial allocation for the array.

Resources