Kth smallest in stream of numbers

Kth smallest in stream of numbers - arrays

We are given a stream of numbers and Q queries.
At each query, we are given a number k.
We need to find the kth smallest number at that point of the stream.
How to approach this problem?
total size of stream is < 10^5
1 < number < 10^9
I tried linked list but finding the right position is time-consuming and in array inserting is time-consuming.

You can use some kind of search tree. They are many different kind of search trees but all the common ones allow insertion in O(log n) and finding the kth element in O(log n) too.
If the stream is too long to keep all the numbers in memory and you also know an upper bound on k, you can prune the tree by only keeping a number of elements equal to the upper bound.

You can use a max heap with size=k.
Put elements until the heap's size reaches to k. After then, put an element and pop the heap's root so you can keep the size=k. Removing(extracting) root makes sense because there are at least k elements smaller than the root value.
When you finished iterating the stream, the root of the heap will be the k-th smallest element. Because you're having smallest k elements in the heap and the root is the largest among them.
As the heap's size is k, time complexity is O(n lg k) which could a bit better than O(n lg n). And the implementation would be a way easy.

Related

Indice of k largest element in a array algorithm

I am looking for an algorithm that returns the indice of the kth largest element in a array. I found many algoritms but most of them return the list of the k largest elements (Extract K largest elements from array of N integers in O(N + K) time, Best way to retrieve K largest elements from large unsorted arrays?, ...).
In this case, only the indice of the kth largest element is needed. All the kth largest elementS are not needed. As array and k are large, I would like to avoid the allocation of an array (or other structure, e.g. linked list) of dimension k and the initial array must be unchanged. What is (are) the most efficient algorithm(s) ?

Finding the k'th largest element in an array cannot be done in less than O(k*n) (or O((n-k)*n)) time without modifying the input array or allocating more than O(1) additional space. If you do not permute the array, you can't do any better than brute force; if you do permute the array, you can't reverse the permutation without keeping extra information around to do it.
(A randomized selection algorithm can achieve linearithmic expected time, but cannot improve on the worst-case time.)

Time and Space Complexity of top k frequent elements in an array

There is a small confusion regarding the time and space complexity for the given problem:
Given an array of size N , return a list of top K frequent elements.
Based on the most popular solution:
Use a HashMap of size K with the count of each entry as value.
Build a MaxHeap of size K by traversing the HashMap generated above.
Pop the elements in the MaxHeap into a list and return the list.
K being the number of unique elements in the input.
The space and time complexity is: O(K) and O(K*log(K)
Now the confusion starts here. We know we are dealing with worst case complexity in the above analysis. So the worst value K can take is N, when all the elements in array are unique.
Hence K <= N. Thereby O(K) be represented as O(N) ??
Thereby, shouldn't the space and time complexity be O(N) and O(N*log(N)) for the above problem?
I know this is a technicality, but its been bothering me for a while. Please advise.

Yes, you are right since K<N, the time complexity for the hashmap part should be O(N).
But heap only have K elements in it and has the time complexity of O(Klog(K)) which if considered asymptotically is far larger than linear complexity of O(N) and hence results in final time complexity of O(Klog(K)).

Sort an n-element array so first k-elements are lowest in increasing order (In place algorithm)

This is a homework question that I am stuck on.
I need to sort an n-element array so that the first k-elements are the lowest and are in increasing order. For k <= n/log(n), the algorithm should be O(n).
My solutions:
A simple solution that I thought of is to heapify (O(n)) the array. Then delete the k-elements and shift the starting index of the heap/array from 0 to 1 - 2 - 3 (and so on, all the way to k). This would be O(n+k*lg(n)+k*n) = O(kn+k*lg(n)). For the given condition of k, it would be O(n^2/log(n) + n).
Another possible implementation would be to use radix sort, which would be O(n) but I have a feeling this is not the right solution because I would be sorting the entire array and they only asked to sort k elements.
You don't have to give me the answer, just a hint would be helpful.

I like your heap idea. I actually think that it would work in the time bounds you listed and that there's a bit of a glitch in your analysis.
Suppose you do the following: build an in-place heap in your array, then dequeue the minimum k elements, leaving the remaining n - k elements where they are in the array. If you think about where the elements will end up, you should have the k smallest elements in the array stored at the back of the array in ascending order, and the n - k remaining elements will be at the front, in heap order. If you're having trouble seeing this, think about how heapsort works - after k dequeues, the largest k elements are in descending order at the back, and the remaining elements are heap-ordered in the front. Here, we've exchanged a min-heap for a max-heap, hence the weird ordering. As a result, if you then reverse the array at the end, you should have the k smallest elements in ascending order at the front, followed by the n - k remaining elements.
This will correctly find the k smallest elements, and the runtime is determined as follows:
Cost of the heapify: O(n)
Cost of k dequeues: O(k log n)
Cost of reversing the array: O(n)
Total cost: O(n + k log n)
Now, suppose that k ≤ n / log n. Then the runtime is
O(n + k log n) = O(n + (n / log n) log n) = O(n)
So you're done! The algorithm works just fine. Plus, it requires O(1) auxiliary space (the heap can be built in-place, and it's possible to reverse an array in O(1) space).
You can do better, though. #timrau suggested in the comments that you use quickselect (or, more generally, any linear-time selection algorithm). These algorithms rearrange arrays to put the lowest k elements in some order in the first k slots of the array and the remaining n - k elements in the last n - k slots in some order. Doing so takes time O(n) regardless of k (nifty!). So suppose you do that, then just sort the first k elements. This takes time O(n + k log k), which is asymptotically better than the O(n + k log n)-time heap-based algorithm.
Of the known linear-time selection algorithms, both quickselect and the median-of-medians algorithms can be implemented in-place if you're careful, so the total space required for this approach is O(1).

It occurs to me that you can do this in-place with a slightly modified heap selection algorithm, which is O(n log k). Although asymptotically "worse" than Quickselect's O(n) complexity, heap select can outperform Quickselect when k is very small in comparison to n. See When theory meets practice for details. But if you're selecting, say, the top 1,000 items from a list of a million, heap select will almost certainly be faster.
Anyway, to do this in place, you build a max-heap (using the standard BuildHeap function) of size k at the front of the array, from the first k items in the array. That takes O(k). Then, you process the rest of the items in the array like this:
for (i = k; i < length; ++i)
{
if (array[i] < array[0]) // If item is smaller than largest item on heap
{
// put large item at the current position
temp = array[i];
array[i] = array[0];
// put new item at the top of heap and sift it down
array[0] = temp;
SiftDown(0);
}
}
That will take O(n log k) time, but the limiting factor is really how many times you have to perform the code inside the conditional. Only when an item is smaller than the largest item already on the heap does this step take any processing. Worst case is when the array is in reverse sorted order. Otherwise it's surprisingly fast.
When done, the smallest k items are at the front of the array.
You then have to sort them, which is O(k log k).
So the complete procedure is O(k + n log k + k log k). Again, when k is much smaller than n, this is considerably faster than Quickselect.

Find elements which need to be removed from an array such that 2*min>max

Consider the following problem:
An array of integers is given. Your goal is to trim the array such that 2*min > max, where min and max are the minimum and maximum elements of the array. You can remove elements either from start or from end of the array if above condition does not meet. The number of removals should be minimized.
For example if the array is
a, b, c, d, e, f
where c is the minimum and e is the maximum, then if 2*c > e is true, we are done. If not, we could remove either from the start (i.e. a,b,c) or from the end (i.e. e, f) such that new min or max would satisfy the condition and removals should be minimum.
I have an O(n2) algorithm for this problem. Can this be solved in time O(n log n)?

Note that the problem is to find the largest subarray that fulfills the condition. Realize that if the condition holds for an interval of indices, it holds also for all the intervals included in it. So if we fix one border, we can greedily choose the other border as much away from it as possible.
It's possible to solve this in linear time:
Define ri as the rightmost possible right boundary if you choose element i as the left boundary. We can show that r is monotonous in i, so we can maintain two pointers to i and ri and increment ri as much as possible every time after we increment i by one. Both pointers are incremented a total of O(n) times and we can maintain the range minima / maxima in O(log n) per increment using a heap or binary search tree of the values in the range.
Using a monotonic queue we can maintain the extrema in O(1) and get a total runtime of O(n). Another C++ implementation of the queue can be found here, for example.
Another somewhat less elegant way would be to use a RMQ data structure. It let's you query the min/max in a range in O(1) after O(n log n) preprocessing (O(n) preprocessing is also possible, but complicated and unnecessary here, since the rest of the algorithm is not linear time).
Now fix the left border (there are n possibilities). Use binary search to find the rightmost right boundary which still fulfills the condition (you can check in O(1) whether it does).
This works because the predicate "range fulfills the condition" is monotonous with regard to inclusion (if a range fulfills it, all ranges included in it also fulfill it).

Can you change the order of elements in the array?
If so, you can do the following steps:
Sort the array with for example the quicksort or whatever best fits
your requirements -> O(n lg n)
Take the first key k = 2*input[0] -> O(1)
Use the binary search algorithm to find the index of the key k in the array -> O(lg n)
If the found index is greater than half the lenght of the input array remove the last part of the array, otherwise remove the first part. -> O(1)
In this case the complexity is the sum of O(n ln n) + O(1) + O(lg n) + O(1) => O(n lg n)

Find the minimum number of elements required so that their sum equals or exceeds S

I know this can be done by sorting the array and taking the larger numbers until the required condition is met. That would take at least nlog(n) sorting time.
Is there any improvement over nlog(n).
We can assume all numbers are positive.

Here is an algorithm that is O(n + size(smallest subset) * log(n)). If the smallest subset is much smaller than the array, this will be O(n).
Read http://en.wikipedia.org/wiki/Heap_%28data_structure%29 if my description of the algorithm is unclear (it is light on details, but the details are all there).
Turn the array into a heap arranged such that the biggest element is available in time O(n).
Repeatedly extract the biggest element from the heap until their sum is large enough. This takes O(size(smallest subset) * log(n)).
This is almost certainly the answer they were hoping for, though not getting it shouldn't be a deal breaker.
Edit: Here is another variant that is often faster, but can be slower.
Walk through elements, until the sum of the first few exceeds S. Store current_sum.
Copy those elements into an array.
Heapify that array such that the minimum is easy to find, remember the minimum.
For each remaining element in the main array:
if min(in our heap) < element:
insert element into heap
increase current_sum by element
while S + min(in our heap) < current_sum:
current_sum -= min(in our heap)
remove min from heap
If we get to reject most of the array without manipulating our heap, this can be up to twice as fast as the previous solution. But it is also possible to be slower, such as when the last element in the array happens to be bigger than S.

Assuming the numbers are integers, you can improve upon the usual n lg(n) complexity of sorting because in this case we have the extra information that the values are between 0 and S (for our purposes, integers larger than S are the same as S).
Because the range of values is finite, you can use a non-comparative sorting algorithm such as Pigeonhole Sort or Radix Sort to go below n lg(n).
Note that these methods are dependent on some function of S, so if S gets large enough (and n stays small enough) you may be better off reverting to a comparative sort.

Here is an O(n) expected time solution to the problem. It's somewhat like Moron's idea but we don't throw out the work that our selection algorithm did in each step, and we start trying from an item potentially in the middle rather than using the repeated doubling approach.
Alternatively, It's really just quickselect with a little additional book keeping for the remaining sum.
First, it's clear that if you had the elements in sorted order, you could just pick the largest items first until you exceed the desired sum. Our solution is going to be like that, except we'll try as hard as we can to not to discover ordering information, because sorting is slow.
You want to be able to determine if a given value is the cut off. If we include that value and everything greater than it, we meet or exceed S, but when we remove it, then we are below S, then we are golden.
Here is the psuedo code, I didn't test it for edge cases, but this gets the idea across.
def Solve(arr, s):
# We could get rid of worse case O(n^2) behavior that basically never happens
# by selecting the median here deterministically, but in practice, the constant
# factor on the algorithm will be much worse.
p = random_element(arr)
left_arr, right_arr = partition(arr, p)
# assume p is in neither left_arr nor right_arr
right_sum = sum(right_arr)
if right_sum + p >= s:
if right_sum < s:
# solved it, p forms the cut off
return len(right_arr) + 1
# took too much, at least we eliminated left_arr and p
return Solve(right_arr, s)
else:
# didn't take enough yet, include all elements from and eliminate right_arr and p
return len(right_arr) + 1 + Solve(left_arr, s - right_sum - p)

One improvement (asymptotically) over Theta(nlogn) you can do is to get an O(n log K) time algorithm, where K is the required minimum number of elements.
Thus if K is constant, or say log n, this is better (asymptotically) than sorting. Of course if K is n^epsilon, then this is not better than Theta(n logn).
The way to do this is to use selection algorithms, which can tell you the ith largest element in O(n) time.
Now do a binary search for K, starting with i=1 (the largest) and doubling i etc at each turn.
You find the ith largest, and find the sum of the i largest elements and check if it is greater than S or not.
This way, you would run O(log K) runs of the selection algorithm (which is O(n)) for a total running time of O(n log K).

eliminate numbers < S, if you find some number ==S, then solved
pigeon-hole sort the numbers < S
Sum elements highest to lowest in the sorted order till you exceed S.