Quicksort partition why random pivot? - arrays

Suppose I have 2 cases:
Case 1: I always choose the 1st element as pivot. In this case the worst case O(n2) is when the array is already sorted or reverse sorted.
Case 2: I choose a random element as pivot. Here worst case O(n2) is possible when the random pivot is always the max or the min element in the subarray.
Can't I argue that if we are given a Random array, P(O(n2) in Case 1) = P(O(n2) in case 2). Because intuitively P(sorted array or reverse sorted array) = P(random pivot is always the max or the min element in the subarray)?
If so, how is the 2nd case any good because we need extra effort to select random pivot? We need 2nd case only when the data would be following a certain pattern. Am I right? Please enlighten.

When all permutations of the input are equally likely, the probability of every time choosing a bad pivot is the same for both strategies (first or random). It would be the same for any strategy that makes no comparison (middle, third, alternating between beforelast and second...).
(This might be different for a strategy that compares elements, such as median-of-three.)
But the truth is that in practice, the permutations aren't equiprobable at all and there is a strong bias toward the nearly sorted sequences.
Said differently, when the input is well shuffled or when you choose the pivot randomly, you must be very unlucky to do a bad drawing every time and the probability of the worst-case is infinitesimal. For a sorted sequences, odds are quite different as you are sure to lose every time !
As a side note, picking a random value indeed has a cost, which is not neglectible compared to the partitioning of small sequences. This is why it matters to switch to a more straightforward sort for sequences of length L or less, and tune the value of L to optimal.

To avoid the worst case, you have to choose the optimal pivot for each subdivision: the median element. If you use any short-cut method to select the pivot, like random, first, median-of-three or whatever, then the possibility is there of encountering the worst case. It's just a question of probabilities.
Certain input cases are likely to occur, at least in some applications, such as the case when the elements are already sorted or nearly sorted.
If there is going to be the threat of worst-case behavior, then it is good to at least mitigate the threat by preventing input cases which are likely from triggering that worst case behavior.

By picking predictable element, like first element, you could easily hit the worst case. If a grain of randomness is added, then the pattern will likely be broken and at some point actual running time of the sorting algorithm will be lower than O(N^2).
On a related note, random picking the pivot is not that good idea either. There are techniques, such as median of medians, which are coming with a proof that worst case running time will still be O(NlogN). That is huge advantage over taking the first element as the pivot.
You can refer to this article for implementation based on median of medians: Finding Kth Smallest Element in an Unsorted Array

We're not worried about the runtime for when we're given a randomly-generated array. We're worried about the runtime when the array is sorted or near-sorted, which is actually pretty common. We're also worried about the runtime when the array is generated by an adversary with elements specifically chosen to ruin our day.
If it were just about random input, picking the first element as the pivot would be fine.

Related

Algorithm - What is the best algorithm for detecting duplicate numbers in small array?

What is the best algorithm for detecting duplicate numbers in array, the best in speed, memory and avoiving overhead.
Small Array like [5,9,13,3,2,5,6,7,1] Note that 5 i dublicate.
After searching and reading about sorting algorithms, I realized that I will use one of these algorithms, Quick Sort, Insertion Sort or Merge Sort.
But actually I am really confused about what to use in my case which is a small array.
Thanks in advance.
To be honest, with that size of array, you may as well choose the O(n2) solution (checking every element against every other element).
You'll generally only need to worry about performance if/when the array gets larger. For small data sets like this, you could well have found the duplicate with an 'inefficient' solution before the sort phase of an efficient solution will have finished :-)
In other words, you can use something like (pseudo-code):
for idx1 = 0 to nums.len - 2 inclusive:
for idx2 = idx1 + 1 to nums.len - 1 inclusive:
if nums[idx1] == nums[idx2]:
return nums[idx1]
return no dups found
This finds the first value in the array which has a duplicate.
If you want an exhaustive list of duplicates, then just add the duplicate value to another (initially empty) array (once only per value) and keep going.
You can sort it using any half-decent algorithm though, for a data set of the size you're discussing, even a bubble sort would probably be adequate. Then you just process the sorted items sequentially, looking for runs of values but it's probably overkill in your case.
Two good approaches depend on the fact that you know or not the range from which numbers are picked up.
Case 1: the range is known.
Suppose you know that all numbers are in the range [a, b[, thus the length of the range is l=b-a.
You can create an array A the length of which is l and fill it with 0s, thus iterate over the original array and for each element e increment the value of A[e-a] (here we are actually mapping the range in [0,l[).
Once finished, you can iterate over A and find the duplicate numbers. In fact, if there exists i such that A[i] is greater than 1, it implies that i+a is a repeated number.
The same idea is behind counting sort, and it works fine also for your problem.
Case 2: the range is not known.
Quite simple. Slightly modify the approach above mentioned, instead of an array use a map where the keys are the number from your original array and the values are the times you find them. At the end, iterate over the set of keys and search those that have been found more then once.
Note.
In both the cases above mentioned, the complexity should be O(N) and you cannot do better, for you have at least to visit all the stored values.
Look at the first example: we iterate over two arrays, the lengths of which are N and l<=N, thus the complexity is at max 2*N, that is O(N).
The second example is indeed a bit more complex and dependent on the implementation of the map, but for the sake of simplicity we can safely assume that it is O(N).
In memory, you are constructing data structures the sizes of which are proportional to the number of different values contained in the original array.
As it usually happens, memory occupancy and performance are the keys of your choice. Greater the former, better the latter and vice versa. As suggested in another response, if you know that the array is small, you can safely rely on an algorithm the complexity of which is O(N^2), but that does not require memory at all.
Which is the best choice? Well, it depends on your problem, we cannot say.

Complexity for finding one of many elements in an array

The question is pretty much what the title says, with a slight variation. If I remember correctly, finding an entry in an array of size 'n' has as the average case the complexity O(n).
I assume that is also the case if there is a fixed number of elements in the vector, of which we want to find one.
But how is it if the amount of entries, of which we still only try to find one, is in some way related to the size of the vector, i.e. grows in some way with it?
I have such a case at hand, but I don't know the exact relation between array size and number of searched-for entries. Might be linear, might be logarithmically.. Is the average case still O(n)?
I would be grateful for any insights.
edit: an example
array size: 100
array content: at each position, a number of 1-10, completely random which one.
what we seek: the first occurrence of "1"
from a naive point of view, we should on average find an entry after 10 lookups in any kind of linear searches (which we have to do, as the content is not sorted.)
As factors are usually omitted in big-O, does that mean that we still need O(n) in time, even though it should be O(n)
It is O(n) anyway.
Think about finding 1 here:
[9,9,9,9,9,1]
If you're doing a linear search through the array, then the average time complexity of finding one of M elements in an array with N elements will be O(I) where I is the average index of the first of the sought elements. If the array is randomly ordered, then I will be O(N/M) on average, and so the time complexity will also be O(N/M) on average and O(N-M) in the worst case.
I have two minds over this question.
First, if you'll consider an unsorted array (which the case seems here), the asymptotic complexity for average case will be surely O(n).
Let's take an example.
We have n elements in the array or better to say Vector. Now,average case will be searching in a linear fashion by node to node. Which appears to be n/2 in general for average or O(n) as an average case. See,if the elements are added, then the complexity's nature won't change but, the effect is clear,it's n/2 comparisons for average---which is directly 1/2 (half) of n. The effect for m elements now after insertion in array will be O(n-m),or in comparison wise,(n-m)/2 comparisons added as a result to addition of elements in the Vector!
So,we find that with increase in size of array or better to say Vector---the complexity's nature won't change though the no. of comparisons required would be more as it is equal to n/2 in average case.
Second, if the array or vector is sorted, then performing binary-searches will have worst-cases of order log(n+1)---again dependent on n. Also, the average case will increase the comparisons logarithmically,but the complexity order O(log n) won't change!

How to know if an array is sorted?

I already read this post but the answer didn't satisfied me Check if Array is sorted in Log(N).
Imagine I have a serious big array over 1,000,000 double numbers (positive and/or negative) and I want to know if the array is "sorted" trying to avoid the max numbers of comparisons because comparing doubles and floats take too much time. Is it possible to use statistics on It?, and if It was:
It is well seen by real-programmers?
Should I take samples?
How many samples should I take
Should they be random, or in a sequence?
How much is the %error permitted to say "the array sorted"?
Thanks.
That depends on your requirements. If you can say that if 100 random samples out of 1.000.000 is enough the assume it's sorted - then so it is. But to be absolutely sure, you will always have to go through every single entry. Only you can answer this question since only you know how certain you need to be about it being sorted.
This is a classic probability problem taught in high school. Consider this question:
What is the probability that the batch will be rejected?
In a batch of 8,000, clocks 7% are defective. A random sample of 10 (without replacement) from the 8,000 is selected and tested. If at least one is defective the entire batch will be rejected.
So you can take a number of random samples from your large array and see if it's sorted, but you must note that you need to know the probability that the sample is out of order. Since you don't have that information, a probabilistic approach wouldn't work efficiently here.
(However, you can check 50% of the array and naively conclude that there is a 50% chance that it is sorted correctly.)
If you run a divide and conquer algorithm using multiprocessing (real parallelism, so only for multi-core CPUs) you can check whether an array is sorted or not in Log(N).
If you have GPU multiprocessing you can achieve Log(N) very easily since modern graphics card are able to run few thousands processes in parallel.
Your question 5 is the question that you need to answer to determine the other answers. To ensure the array is perfectly sorted you must go through every element, because any one of them could be the one out of place.
The maximum number of comparisons to decide whether the array is sorted is N-1, because there are N-1 adjacent number pairs to compare. But for simplicity, we'll say N as it does not matter if we look at N or N+1 numbers.
Furthermore, it is unimportant where you start, so let's just start at the beginning.
Comparison #1 (A[0] vs. A[1]). If it fails, the array is unsorted. If it succeeds, good.
As we only compare, we can reduce this to the neighbors and whether the left one is smaller or equal (1) or not (0). So we can treat the array as a sequence of 0's and 1's, indicating whether two adjacent numbers are in order or not.
Calculating the error rate or the propability (correct spelling?) we will have to look at all combinations of our 0/1 sequence.
I would look at it like this: We have 2^n combinations of an array (i.e. the order of the pairs, of which only one is sorted (all elements are 1 indicating that each A[i] is less or equal to A[i+1]).
Now this seems to be simple:
initially the error is 1/2^N. After the first comparison half of the possible combinations (all unsorted) get eliminated. So the error rate should be 1/2^n + 1/2^(n-1).
I'm not a mathematician, but it should be quite easy to calculate how many elements are needed to reach the error rate (find x such that ERROR >= sum of 1/2^n + 1/2^(n-1)... 1/^(2-x) )
Sorry for the confusing english. I come from germany..
Since every single element can be the one element that is out-of-line, you have to run through all of them, hence your algorithm has runtime O(n).
If your understanding of "sorted" is less strict, you need to specify what exaclty you mean by "sorted". Usually, "sorted" means that adjacent elements meet a less or less-or-equal condition.
Like everyone else says, the only way to be 100% sure that it is sorted is to run through every single element, which is O(N).
However, it seems to me that if you're so worried about it being sorted, then maybe having it sorted to begin with is more important than the array elements being stored in a contiguous portion in memory?
What I'm getting at is, you could use a map whose elements by definition follow a strict weak ordering. In other words, the elements in a map are always sorted. You could also use a set to achieve the same effect.
For example: std::map<int,double> collectoin; would allow you to almost use it like an array: collection[0]=3.0; std::cout<<collection[0]<<std:;endl;. There are differences, of course, but if the sorting is so important then an array is the wrong choice for storing the data.
The old fashion way.Print it out and see if there in order. Really if your sort is wrong you would probably see it soon. It's more unlikely that you would only see a few misorders if you were sorting like 100+ things. When ever I deal with it my whole thing is completely off or it works.
As an example that you probably should not use but demonstrates sampling size:
Statistically valid sample size can give you a reasonable estimate of sortedness. If you want to be 95% certain eerything is sorted you can do that by creating a list of truly random points to sample, perhaps ~1500.
Essentially this is completely pointless if the list of values being out of order in one single place will break subsequent algorithms or data requirements.
If this is a problem, preprocess the list before your code runs, or use a really fast sort package in your code. Most sort packages also have a validation mode, where it simply tells you yes, the list meets your sort criteria - or not. Other suggestions like parallelization of your check with threads are great ideas.

Find the one non-repeating element in array?

I have an array of n elements in which only one element is not repeated, else all the other numbers are repeated >1 times. And there is no limit on the range of the numbers in the array.
Some solutions are:
Making use of hash, but that would result in linear time complexity but very poor space complexity
Sorting the list using MergeSort O(nlogn) and then finding the element which doesn't repeat
Is there a better solution?
One general approach is to implement a bucketing technique (of which hashing is such a technique) to distribute the elements into different "buckets" using their identity (say index) and then find the bucket with the smallest size (1 in your case). This problem, I believe, is also known as the minority element problem. There will be as many buckets as there are unique elements in your set.
Doing this by hashing is problematic because of collisions and how your algorithm might handle that. Certain associative array approaches such as tries and extendable hashing don't seem to apply as they are better suited to strings.
One application of the above is to the Union-Find data structure. Your sets will be the buckets and you'll need to call MakeSet() and Find() for each element in your array for a cost of $O(\alpha(n))$ per call, where $\alpha(n)$ is the extremely slow-growing inverse Ackermann function. You can think of it as being effectively a constant.
You'll have to do Union when an element already exist. With some changes to keep track of the set with minimum cardinality, this solution should work. The time complexity of this solution is $O(n\alpha(n))$.
Your problem also appears to be loosely related to the Element Uniqueness problem.
Try a multi-pass scanning if you have strict space limitation.
Say the input has n elements and you can only hold m elements in your memory. If you use a hash-table approach, in the worst case you need to handle n/2 unique numbers so you want m>n/2. In case you don't have that big m, you can partition n elements to k=(max(input)-min(input))/(2m) groups, and go ahead scan the n input elements k times (in the worst case):
1st run: you only hash-get/put/mark/whatever elements with value < min(input)+m*2; because in the range (min(input), min(input)+m*2) there are at most m unique elements and you can handle that. If you are lucky you already find the unique one, otherwise continue.
2nd run: only operate on elements with value in range (min(input)+m*2, min(input)+m*4), and
so on, so forth
In this way, you compromise the time complexity to a O(kn), but you get a space complexity bound of O(m)
Two ideas come to my mind:
A smoothsort may be a better alternative than the cited mergesort for your needs given it's O(1) in memory usage, O(nlogn) in the worst case as the merge sort but O(n) in the best case;
Based on the (reverse) idea of the splay tree, you could make a type of tree that would
push the leafs toward the bottom once they are used (instead of upward as in the splay tree). This would still give you a O(nlogn) implantation of the sort, but the advantage would be the O(1) step of finding the unique element, it would be the root. The sorting algorithm is the sum of O(nlogn) + O(n) and this algorithm would be O(nlogn) + O(1)
Otherwise, as you stated, using a hash based solution (like hash-implemented set) would result in a O(n) algorithm (O(n) to insert and add a counting reference to it and O(n) to traverse your set to find the unique element) but you seemed to dislike the memory usage, though I don't know why. Memory is cheap, these days...

Quicksort complexity when all the elements are same?

I have an array of N numbers which are same.I am applying Quick sort on it.
What should be the time complexity of the sorting in this case.
I goggled around this question but did not get the exact explanation.
Any help would be appreciated.
This depends on the implementation of Quicksort. The traditional implementation which partitions into 2 (< and >=) sections will have O(n*n) on identical input. While no swaps will necessarily occur, it will cause n recursive calls to be made - each of which need to make a comparison with the pivot and n-recursionDepth elements. i.e. O(n*n) comparisons need to be made
However there is a simple variant which partitions into 3 sets (<, = and >). This variant has O(n) performance in this case - instead of choosing the pivot, swapping and then recursing on 0to pivotIndex-1 and pivotIndex+1 to n, it will put swap all things equal to the pivot to the 'middle' partition (which in the case of all identical inputs always means swapping with itself i.e. a no-op) meaning the call stack will only be 1 deep in this particular case n comparisons and no swaps occur. I believe this variant has made its way into the standard library on linux at least.
The performance of quicksort depends on the pivot selection. The closer the chosen pivot is to the median element, the better is quicksort's performance.
In this specific case you're lucky - the pivot you select will always be a median, since all values are the same. The partition step of quicksort will hence never have to swap elements, and the two pointers will meet exactly in the middle. The two subproblems will have therefore be exactly half the size - giving you a perfect O(n log n).
To be a little more specific, this depends on how well the partition step is implemented. The loop-invariant only needs to make sure that smaller elements are in the left-hand sub-problem, while greater elements are in the right-hand sub-problem. There's no guarantee that a partition implementation never swaps equal elements. But it is always unnecessary work, so no clever implementation should do it: The left and right pointers will never detect an inversion respective the pivot (i.e. you will never hit the case where *left > pivot && *right < pivot) and so the left pointer will be incremented, the right pointer will be decremented every step and they will finally meet in the middle, generating subproblems of size n/2.
It depends on the particular implementation.
If there is only one kind of comparison (≤ or <) to determine where the other elements go relative to the pivot, they will all go into one of the partitions, and you will get O(n2) performance, since the problem size will decrease by only 1 each step.
The algorithm listed here is an example (the accompanying illustration are for a different algorithm).
If there are two kinds of comparisons, for example < for elements on the left and > for elements on the right, as is the case in a two-pointer implementation, and if you take care to move the pointers in step, then you might get perfect O(n log n) performance, because half the equal elements will be split evenly in the two partitions.
The illustrations in the link above use an algorithm which doesn't move pointers in step, so you still get poor performance (look at the "Few unique" case).
So it depends whether you have this special case in mind when implementing the algorithm.
Practical implementations often handle a broader special case: if there are no swaps in the partitioning step, they assume the data is nearly sorted, and use an insertion sort, which gives an even better O(n) in the case of all equal elements.
tobyodavies has provided the right solution. It does handle the case and finish in O(n) time when all the keys are equal.
It is the same partitioning as we do in dutch national flag problem
http://en.wikipedia.org/wiki/Dutch_national_flag_problem
Sharing the code from princeton
http://algs4.cs.princeton.edu/23quicksort/Quick3way.java.html
If you implement the 2-way partitioning algorithm then at every step the array will be halved. This is because when identical keys will be encountered, the scan stops. As a result at each step, the partitioning element will be positioned at the center of the subarray thereby halving the array in every subsequent recursive call. Now, this case is similar to the mergesort case which uses ~N lg N compares to sort an array of N elements. Ergo for duplicate keys, the traditional 2-way partitioning algorithm for Quicksort uses ~N lg N compares, thereby following a linearithmic approach.
Quick Sort code is done using "partition" and "quicksort" functions.
Basically, there are two best ways for implementing Quicksort.
The difference between these two is only the "partition" function,
1.Lomuto
2.Hoare
With a partitioning algorithm such as the Lomuto partition scheme described above (even one that chooses good pivot values), quicksort exhibits poor performance for inputs that contain many repeated elements. The problem is clearly apparent when all the input elements are equal: at each recursion, the left partition is empty (no input values are less than the pivot), and the right partition has only decreased by one element (the pivot is removed). Consequently, the Lomuto partition scheme takes quadratic time to sort an array of equal values.
So, this takes O(n^2) time using the Lomuto partition algorithm.
By using the Hoare partition algorithm we get the best case with all the array elements equal. The time complexity is O(n).
Reference: https://en.wikipedia.org/wiki/Quicksort

Resources