Adapting quickselect for smallest k elements in an array - arrays

I know that I can get the Kth order statistic (i.e. the kth smallest number in an array) by using quickselect in almost linear time, but what if I needed the k smallest elements of an array?
The wikipedia link has a pseudocode for the single-element lookup, but not for the k smallest elements lookup.
How should quickselect be modified to attain it in linear time (if possible) ?

I believe that after you use quickselest to find the k-th statictic, you will automatically find that the first k elements of the resulting array are the k smallest elements, only probably not sorted.
Moreover, quickselect actually does partitioning with respect to the k-th statistics: all the elements before k-th statistic is smaller (or equal) to it, and all the elements after are bigger or equal. This is easy to prove.
Note, for example that for C++ nth_element
The other elements are left without any specific order, except that
none of the elements preceding nth are greater than it, and none of
the elements following it are less.
If you need not just k smallest elements, but sorted k smallest elements, you can of course sort them after quickselect.

Actually modifying quickselect is not needed. If I had an array (called arrayToSearch in this example) and I wanted the k smallest items I'd do this:
int i;
int k = 10; // if you wanted the 10 smallest elements
int smallestItems = new Array(k);
for (i = 0; i < k; i++)
{
smallestItems[i] = quickselect(i, arrayToSearch);
}
Edit: I was under the assumption that k would be a relatively small number which would make the effective Big-O O(n). If not assuming k is small this would have a speed of O(k*n), not linear time. My answer is easier to comprehend, and applicable for most practical purposes. recursion.ninja's answer may be more technically correct, and therefore better for academic purposes.

Related

Indice of k largest element in a array algorithm

I am looking for an algorithm that returns the indice of the kth largest element in a array. I found many algoritms but most of them return the list of the k largest elements (Extract K largest elements from array of N integers in O(N + K) time, Best way to retrieve K largest elements from large unsorted arrays?, ...).
In this case, only the indice of the kth largest element is needed. All the kth largest elementS are not needed. As array and k are large, I would like to avoid the allocation of an array (or other structure, e.g. linked list) of dimension k and the initial array must be unchanged. What is (are) the most efficient algorithm(s) ?
Finding the k'th largest element in an array cannot be done in less than O(k*n) (or O((n-k)*n)) time without modifying the input array or allocating more than O(1) additional space. If you do not permute the array, you can't do any better than brute force; if you do permute the array, you can't reverse the permutation without keeping extra information around to do it.
(A randomized selection algorithm can achieve linearithmic expected time, but cannot improve on the worst-case time.)

Given an array A1, A2 ... AN and K, count how many subarrays have inversion count greater than or equal to K

Q) Given an array A1, A2 ... AN and K count how many subarrays have inversion count greater than or equal to K.
N <= 10^5
K <= N*(N-1)/2
So, this question I came across in an interview. I came up with the naive solution of forming all subarrays with two for loops (O(n^2) ) and counting inversions in the array using modified merge sort which is O(nlogn). This leads to a complexity of O(n^3logn) which I guess can be improved. Any leads how I can improve it? Thanks!
You can solve it in O(nlogn) if I'm not wrong, using two moving pointers.
Start with the left pointer in the first element and move the right pointer until you have a subarray with >= K inversions. To do that, you can use any balanced binary search tree and every time you move the pointer to the right, count how many elements bigger than this one are already in the tree. Then you insert the element in the tree too.
When you hit the point in which you already have >= K inversions, you know that every longer subarray with the same starting element also satisfies the restriction, so you can add them all.
Then move the left pointer one position to the right and subtract the inversions of it (again, look in the tree for elements smaller than it). Now you can do the same as before again.
An amortized analysis easily shows that this is O(nlogn), as the two pointers only traverse once the array and each operation in the tree is O(logn).

What is the lowest bound for the algorithm?

Let an algorithm which get unsorted array with the size of n. Let a number k<=n. The algorithm prints the k-smallest numbers from 1 to k (ascending). What is the lower bound for the algorithm (for every k)?
Omega(n)
Omega(k*logn)
Omega(n*logk)
Omega(n*logn)
#1,#2 Are both correct.
Now, from my understanding, if we want to find a lower-bound to an algorithm we need to look at the worst-case. If that the case, then obviously the worst-case is when k=n. We know that sorting an array is bounded by Omega(nlogn) so the right answer is #4.
Unfortunately, I am wrong and the right answer is #5.
Why?
It can be done in O(n + klogk).
Run selection algorithm to find the k smallest element - O(n)
Iterate and return the elements lower/equals k - O(n)
Another iteration might be needed in case of the array allows
duplicates, but it is still done in O(n)
Lastly, you need to sort these elements in O(klogk)
It is easy to see this solution is optimal - cannot get better than O(klogk) factor because otherwise for assigning k=n you could sort any array better, and a linear scan at least is a must to find the required elements to be printed.
Lets try with Linear time:
In order to find the k'th smallest element, we have to use "Randomized-Select" which has the average running time of O(n). And use that element as pivot for the quick sort.
Use Quick sort method to split the array[i] <= k and array[i]>k. This would take O(n) time
Take the unsorted left array[i]<=k (which has k elements) and do counting sort, which will obviously take O(k+K)
Finally the print operation will take O(k)
Total time = O(n)+O(k+K)+O(k) = O(n+k+K)
Here, k is the number of elements which are smaller or equal to K

Inserting unknown number of elements into dynamic array in linear time

(This question is inspired by deque::insert() at index?, I was surprised that it wasn't covered in my algorithm lecture and that I also didn't find it mentioned in another question here and even not in Wikipedia :). I think it might be of general interest and I will answer it myself ...)
Dynamic arrays are datastructures that allow addition of elements at the end in amortized constant time O(1) (by doubling the size of the allocated memory each time it needs to grow, see Amortized time of dynamic array for a short analysis).
However, insertion of a single element in the middle of the array takes linear time O(n), since in the worst case (i.e. insertion at first position) all other elements needs to be shifted by one.
If I want to insert k elements at a specific index in the array, the naive approach of performit the insert operation k times would thus lead to a complexity of O(n*k) and, if k=O(n), to a quadratic complexity of O(n²).
If I know k in advance, the solution is quite easy: Expand the array if neccessary (possibly reallocating space), shift the elements starting at the insertion point by k and simply copy the new elements.
But there might be situations, where I do not know the number of elements I want to insert in advance: For example I might get the elements from a stream-like interface, so I only get a flag when the last element is read.
Is there a way to insert multiple (k) elements, where k is not known in advance, into a dynamic array at consecutive positions in linear time?
In fact there is a way and it is quite simple:
First append all k elements at the end of the array. Since appending one element takes O(1) time, this will be done in O(k) time.
Second rotate the elements into place. If you want to insert the elements at position index. For this you need to rotate the subarray A[pos..n-1] by k positions to the right (or n-pos-k positions to the left, which is equivalent). Rotation can be done in linear time by use of a reverse operation as explained in Algorithm to rotate an array in linear time. Thus the time needed for rotation is O(n).
Therefore the total time for the algorithm is O(k)+O(n)=O(n+k). If the number of elements to be inserted is in the order of n (k=O(n)), you'll get O(n+n)=O(2n)=O(n) and thus linear time.
You could simply allocate a new array of length k+n and insert the desired elements linearly.
newArr = new T[k + n];
for (int i = 0; i < k + n; i++)
newArr[i] = i <= insertionIndex ? oldArr[i]
: i <= insertionIndex + k ? toInsert[i - insertionIndex - 1]
: oldArr[i - k];
return newArr;
Each iteration takes constant time, and it runs k+n times, thus O(k+n) (or, O(n) if you so like).

Find the minimum number of elements required so that their sum equals or exceeds S

I know this can be done by sorting the array and taking the larger numbers until the required condition is met. That would take at least nlog(n) sorting time.
Is there any improvement over nlog(n).
We can assume all numbers are positive.
Here is an algorithm that is O(n + size(smallest subset) * log(n)). If the smallest subset is much smaller than the array, this will be O(n).
Read http://en.wikipedia.org/wiki/Heap_%28data_structure%29 if my description of the algorithm is unclear (it is light on details, but the details are all there).
Turn the array into a heap arranged such that the biggest element is available in time O(n).
Repeatedly extract the biggest element from the heap until their sum is large enough. This takes O(size(smallest subset) * log(n)).
This is almost certainly the answer they were hoping for, though not getting it shouldn't be a deal breaker.
Edit: Here is another variant that is often faster, but can be slower.
Walk through elements, until the sum of the first few exceeds S. Store current_sum.
Copy those elements into an array.
Heapify that array such that the minimum is easy to find, remember the minimum.
For each remaining element in the main array:
if min(in our heap) < element:
insert element into heap
increase current_sum by element
while S + min(in our heap) < current_sum:
current_sum -= min(in our heap)
remove min from heap
If we get to reject most of the array without manipulating our heap, this can be up to twice as fast as the previous solution. But it is also possible to be slower, such as when the last element in the array happens to be bigger than S.
Assuming the numbers are integers, you can improve upon the usual n lg(n) complexity of sorting because in this case we have the extra information that the values are between 0 and S (for our purposes, integers larger than S are the same as S).
Because the range of values is finite, you can use a non-comparative sorting algorithm such as Pigeonhole Sort or Radix Sort to go below n lg(n).
Note that these methods are dependent on some function of S, so if S gets large enough (and n stays small enough) you may be better off reverting to a comparative sort.
Here is an O(n) expected time solution to the problem. It's somewhat like Moron's idea but we don't throw out the work that our selection algorithm did in each step, and we start trying from an item potentially in the middle rather than using the repeated doubling approach.
Alternatively, It's really just quickselect with a little additional book keeping for the remaining sum.
First, it's clear that if you had the elements in sorted order, you could just pick the largest items first until you exceed the desired sum. Our solution is going to be like that, except we'll try as hard as we can to not to discover ordering information, because sorting is slow.
You want to be able to determine if a given value is the cut off. If we include that value and everything greater than it, we meet or exceed S, but when we remove it, then we are below S, then we are golden.
Here is the psuedo code, I didn't test it for edge cases, but this gets the idea across.
def Solve(arr, s):
# We could get rid of worse case O(n^2) behavior that basically never happens
# by selecting the median here deterministically, but in practice, the constant
# factor on the algorithm will be much worse.
p = random_element(arr)
left_arr, right_arr = partition(arr, p)
# assume p is in neither left_arr nor right_arr
right_sum = sum(right_arr)
if right_sum + p >= s:
if right_sum < s:
# solved it, p forms the cut off
return len(right_arr) + 1
# took too much, at least we eliminated left_arr and p
return Solve(right_arr, s)
else:
# didn't take enough yet, include all elements from and eliminate right_arr and p
return len(right_arr) + 1 + Solve(left_arr, s - right_sum - p)
One improvement (asymptotically) over Theta(nlogn) you can do is to get an O(n log K) time algorithm, where K is the required minimum number of elements.
Thus if K is constant, or say log n, this is better (asymptotically) than sorting. Of course if K is n^epsilon, then this is not better than Theta(n logn).
The way to do this is to use selection algorithms, which can tell you the ith largest element in O(n) time.
Now do a binary search for K, starting with i=1 (the largest) and doubling i etc at each turn.
You find the ith largest, and find the sum of the i largest elements and check if it is greater than S or not.
This way, you would run O(log K) runs of the selection algorithm (which is O(n)) for a total running time of O(n log K).
eliminate numbers < S, if you find some number ==S, then solved
pigeon-hole sort the numbers < S
Sum elements highest to lowest in the sorted order till you exceed S.

Resources