Time and Space Complexity of top k frequent elements in an array - arrays

There is a small confusion regarding the time and space complexity for the given problem:
Given an array of size N , return a list of top K frequent elements.
Based on the most popular solution:
Use a HashMap of size K with the count of each entry as value.
Build a MaxHeap of size K by traversing the HashMap generated above.
Pop the elements in the MaxHeap into a list and return the list.
K being the number of unique elements in the input.
The space and time complexity is: O(K) and O(K*log(K)
Now the confusion starts here. We know we are dealing with worst case complexity in the above analysis. So the worst value K can take is N, when all the elements in array are unique.
Hence K <= N. Thereby O(K) be represented as O(N) ??
Thereby, shouldn't the space and time complexity be O(N) and O(N*log(N)) for the above problem?
I know this is a technicality, but its been bothering me for a while. Please advise.

Yes, you are right since K<N, the time complexity for the hashmap part should be O(N).
But heap only have K elements in it and has the time complexity of O(Klog(K)) which if considered asymptotically is far larger than linear complexity of O(N) and hence results in final time complexity of O(Klog(K)).

Related

What is the lowest bound for the algorithm?

Let an algorithm which get unsorted array with the size of n. Let a number k<=n. The algorithm prints the k-smallest numbers from 1 to k (ascending). What is the lower bound for the algorithm (for every k)?
Omega(n)
Omega(k*logn)
Omega(n*logk)
Omega(n*logn)
#1,#2 Are both correct.
Now, from my understanding, if we want to find a lower-bound to an algorithm we need to look at the worst-case. If that the case, then obviously the worst-case is when k=n. We know that sorting an array is bounded by Omega(nlogn) so the right answer is #4.
Unfortunately, I am wrong and the right answer is #5.
Why?
It can be done in O(n + klogk).
Run selection algorithm to find the k smallest element - O(n)
Iterate and return the elements lower/equals k - O(n)
Another iteration might be needed in case of the array allows
duplicates, but it is still done in O(n)
Lastly, you need to sort these elements in O(klogk)
It is easy to see this solution is optimal - cannot get better than O(klogk) factor because otherwise for assigning k=n you could sort any array better, and a linear scan at least is a must to find the required elements to be printed.
Lets try with Linear time:
In order to find the k'th smallest element, we have to use "Randomized-Select" which has the average running time of O(n). And use that element as pivot for the quick sort.
Use Quick sort method to split the array[i] <= k and array[i]>k. This would take O(n) time
Take the unsorted left array[i]<=k (which has k elements) and do counting sort, which will obviously take O(k+K)
Finally the print operation will take O(k)
Total time = O(n)+O(k+K)+O(k) = O(n+k+K)
Here, k is the number of elements which are smaller or equal to K

Sort an n-element array so first k-elements are lowest in increasing order (In place algorithm)

This is a homework question that I am stuck on.
I need to sort an n-element array so that the first k-elements are the lowest and are in increasing order. For k <= n/log(n), the algorithm should be O(n).
My solutions:
A simple solution that I thought of is to heapify (O(n)) the array. Then delete the k-elements and shift the starting index of the heap/array from 0 to 1 - 2 - 3 (and so on, all the way to k). This would be O(n+k*lg(n)+k*n) = O(kn+k*lg(n)). For the given condition of k, it would be O(n^2/log(n) + n).
Another possible implementation would be to use radix sort, which would be O(n) but I have a feeling this is not the right solution because I would be sorting the entire array and they only asked to sort k elements.
You don't have to give me the answer, just a hint would be helpful.
I like your heap idea. I actually think that it would work in the time bounds you listed and that there's a bit of a glitch in your analysis.
Suppose you do the following: build an in-place heap in your array, then dequeue the minimum k elements, leaving the remaining n - k elements where they are in the array. If you think about where the elements will end up, you should have the k smallest elements in the array stored at the back of the array in ascending order, and the n - k remaining elements will be at the front, in heap order. If you're having trouble seeing this, think about how heapsort works - after k dequeues, the largest k elements are in descending order at the back, and the remaining elements are heap-ordered in the front. Here, we've exchanged a min-heap for a max-heap, hence the weird ordering. As a result, if you then reverse the array at the end, you should have the k smallest elements in ascending order at the front, followed by the n - k remaining elements.
This will correctly find the k smallest elements, and the runtime is determined as follows:
Cost of the heapify: O(n)
Cost of k dequeues: O(k log n)
Cost of reversing the array: O(n)
Total cost: O(n + k log n)
Now, suppose that k ≤ n / log n. Then the runtime is
O(n + k log n) = O(n + (n / log n) log n) = O(n)
So you're done! The algorithm works just fine. Plus, it requires O(1) auxiliary space (the heap can be built in-place, and it's possible to reverse an array in O(1) space).
You can do better, though. #timrau suggested in the comments that you use quickselect (or, more generally, any linear-time selection algorithm). These algorithms rearrange arrays to put the lowest k elements in some order in the first k slots of the array and the remaining n - k elements in the last n - k slots in some order. Doing so takes time O(n) regardless of k (nifty!). So suppose you do that, then just sort the first k elements. This takes time O(n + k log k), which is asymptotically better than the O(n + k log n)-time heap-based algorithm.
Of the known linear-time selection algorithms, both quickselect and the median-of-medians algorithms can be implemented in-place if you're careful, so the total space required for this approach is O(1).
It occurs to me that you can do this in-place with a slightly modified heap selection algorithm, which is O(n log k). Although asymptotically "worse" than Quickselect's O(n) complexity, heap select can outperform Quickselect when k is very small in comparison to n. See When theory meets practice for details. But if you're selecting, say, the top 1,000 items from a list of a million, heap select will almost certainly be faster.
Anyway, to do this in place, you build a max-heap (using the standard BuildHeap function) of size k at the front of the array, from the first k items in the array. That takes O(k). Then, you process the rest of the items in the array like this:
for (i = k; i < length; ++i)
{
if (array[i] < array[0]) // If item is smaller than largest item on heap
{
// put large item at the current position
temp = array[i];
array[i] = array[0];
// put new item at the top of heap and sift it down
array[0] = temp;
SiftDown(0);
}
}
That will take O(n log k) time, but the limiting factor is really how many times you have to perform the code inside the conditional. Only when an item is smaller than the largest item already on the heap does this step take any processing. Worst case is when the array is in reverse sorted order. Otherwise it's surprisingly fast.
When done, the smallest k items are at the front of the array.
You then have to sort them, which is O(k log k).
So the complete procedure is O(k + n log k + k log k). Again, when k is much smaller than n, this is considerably faster than Quickselect.

Inserting unknown number of elements into dynamic array in linear time

(This question is inspired by deque::insert() at index?, I was surprised that it wasn't covered in my algorithm lecture and that I also didn't find it mentioned in another question here and even not in Wikipedia :). I think it might be of general interest and I will answer it myself ...)
Dynamic arrays are datastructures that allow addition of elements at the end in amortized constant time O(1) (by doubling the size of the allocated memory each time it needs to grow, see Amortized time of dynamic array for a short analysis).
However, insertion of a single element in the middle of the array takes linear time O(n), since in the worst case (i.e. insertion at first position) all other elements needs to be shifted by one.
If I want to insert k elements at a specific index in the array, the naive approach of performit the insert operation k times would thus lead to a complexity of O(n*k) and, if k=O(n), to a quadratic complexity of O(n²).
If I know k in advance, the solution is quite easy: Expand the array if neccessary (possibly reallocating space), shift the elements starting at the insertion point by k and simply copy the new elements.
But there might be situations, where I do not know the number of elements I want to insert in advance: For example I might get the elements from a stream-like interface, so I only get a flag when the last element is read.
Is there a way to insert multiple (k) elements, where k is not known in advance, into a dynamic array at consecutive positions in linear time?
In fact there is a way and it is quite simple:
First append all k elements at the end of the array. Since appending one element takes O(1) time, this will be done in O(k) time.
Second rotate the elements into place. If you want to insert the elements at position index. For this you need to rotate the subarray A[pos..n-1] by k positions to the right (or n-pos-k positions to the left, which is equivalent). Rotation can be done in linear time by use of a reverse operation as explained in Algorithm to rotate an array in linear time. Thus the time needed for rotation is O(n).
Therefore the total time for the algorithm is O(k)+O(n)=O(n+k). If the number of elements to be inserted is in the order of n (k=O(n)), you'll get O(n+n)=O(2n)=O(n) and thus linear time.
You could simply allocate a new array of length k+n and insert the desired elements linearly.
newArr = new T[k + n];
for (int i = 0; i < k + n; i++)
newArr[i] = i <= insertionIndex ? oldArr[i]
: i <= insertionIndex + k ? toInsert[i - insertionIndex - 1]
: oldArr[i - k];
return newArr;
Each iteration takes constant time, and it runs k+n times, thus O(k+n) (or, O(n) if you so like).

Find the minimum number of elements required so that their sum equals or exceeds S

I know this can be done by sorting the array and taking the larger numbers until the required condition is met. That would take at least nlog(n) sorting time.
Is there any improvement over nlog(n).
We can assume all numbers are positive.
Here is an algorithm that is O(n + size(smallest subset) * log(n)). If the smallest subset is much smaller than the array, this will be O(n).
Read http://en.wikipedia.org/wiki/Heap_%28data_structure%29 if my description of the algorithm is unclear (it is light on details, but the details are all there).
Turn the array into a heap arranged such that the biggest element is available in time O(n).
Repeatedly extract the biggest element from the heap until their sum is large enough. This takes O(size(smallest subset) * log(n)).
This is almost certainly the answer they were hoping for, though not getting it shouldn't be a deal breaker.
Edit: Here is another variant that is often faster, but can be slower.
Walk through elements, until the sum of the first few exceeds S. Store current_sum.
Copy those elements into an array.
Heapify that array such that the minimum is easy to find, remember the minimum.
For each remaining element in the main array:
if min(in our heap) < element:
insert element into heap
increase current_sum by element
while S + min(in our heap) < current_sum:
current_sum -= min(in our heap)
remove min from heap
If we get to reject most of the array without manipulating our heap, this can be up to twice as fast as the previous solution. But it is also possible to be slower, such as when the last element in the array happens to be bigger than S.
Assuming the numbers are integers, you can improve upon the usual n lg(n) complexity of sorting because in this case we have the extra information that the values are between 0 and S (for our purposes, integers larger than S are the same as S).
Because the range of values is finite, you can use a non-comparative sorting algorithm such as Pigeonhole Sort or Radix Sort to go below n lg(n).
Note that these methods are dependent on some function of S, so if S gets large enough (and n stays small enough) you may be better off reverting to a comparative sort.
Here is an O(n) expected time solution to the problem. It's somewhat like Moron's idea but we don't throw out the work that our selection algorithm did in each step, and we start trying from an item potentially in the middle rather than using the repeated doubling approach.
Alternatively, It's really just quickselect with a little additional book keeping for the remaining sum.
First, it's clear that if you had the elements in sorted order, you could just pick the largest items first until you exceed the desired sum. Our solution is going to be like that, except we'll try as hard as we can to not to discover ordering information, because sorting is slow.
You want to be able to determine if a given value is the cut off. If we include that value and everything greater than it, we meet or exceed S, but when we remove it, then we are below S, then we are golden.
Here is the psuedo code, I didn't test it for edge cases, but this gets the idea across.
def Solve(arr, s):
# We could get rid of worse case O(n^2) behavior that basically never happens
# by selecting the median here deterministically, but in practice, the constant
# factor on the algorithm will be much worse.
p = random_element(arr)
left_arr, right_arr = partition(arr, p)
# assume p is in neither left_arr nor right_arr
right_sum = sum(right_arr)
if right_sum + p >= s:
if right_sum < s:
# solved it, p forms the cut off
return len(right_arr) + 1
# took too much, at least we eliminated left_arr and p
return Solve(right_arr, s)
else:
# didn't take enough yet, include all elements from and eliminate right_arr and p
return len(right_arr) + 1 + Solve(left_arr, s - right_sum - p)
One improvement (asymptotically) over Theta(nlogn) you can do is to get an O(n log K) time algorithm, where K is the required minimum number of elements.
Thus if K is constant, or say log n, this is better (asymptotically) than sorting. Of course if K is n^epsilon, then this is not better than Theta(n logn).
The way to do this is to use selection algorithms, which can tell you the ith largest element in O(n) time.
Now do a binary search for K, starting with i=1 (the largest) and doubling i etc at each turn.
You find the ith largest, and find the sum of the i largest elements and check if it is greater than S or not.
This way, you would run O(log K) runs of the selection algorithm (which is O(n)) for a total running time of O(n log K).
eliminate numbers < S, if you find some number ==S, then solved
pigeon-hole sort the numbers < S
Sum elements highest to lowest in the sorted order till you exceed S.

Sorting an almost sorted array (elements misplaced by no more than k)

I was asked this interview question recently:
You're given an array that is almost sorted, in that each of the N elements may be misplaced by no more than k positions from the correct sorted order. Find a space-and-time efficient algorithm to sort the array.
I have an O(N log k) solution as follows.
Let's denote arr[0..n) to mean the elements of the array from index 0 (inclusive) to N (exclusive).
Sort arr[0..2k)
Now we know that arr[0..k) are in their final sorted positions...
...but arr[k..2k) may still be misplaced by k!
Sort arr[k..3k)
Now we know that arr[k..2k) are in their final sorted positions...
...but arr[2k..3k) may still be misplaced by k
Sort arr[2k..4k)
....
Until you sort arr[ik..N), then you're done!
This final step may be cheaper than the other steps when you have less than 2k elements left
In each step, you sort at most 2k elements in O(k log k), putting at least k elements in their final sorted positions at the end of each step. There are O(N/k) steps, so the overall complexity is O(N log k).
My questions are:
Is O(N log k) optimal? Can this be improved upon?
Can you do this without (partially) re-sorting the same elements?
As Bob Sedgewick showed in his dissertation work (and follow-ons), insertion sort absolutely crushes the "almost-sorted array". In this case your asymptotics look good but if k < 12 I bet insertion sort wins every time. I don't know that there's a good explanation for why insertion sort does so well, but the place to look would be in one of Sedgewick's textbooks entitled Algorithms (he has done many editions for different languages).
I have no idea whether O(N log k) is optimal, but more to the point, I don't really care—if k is small, it's the constant factors that matter, and if k is large, you may as well just sort the array.
Insertion sort will nail this problem without re-sorting the same elements.
Big-O notation is all very well for algorithm class, but in the real world, constants matter. It's all too easy to lose sight of this. (And I say this as a professor who has taught Big-O notation!)
If using only the comparison model, O(n log k) is optimal. Consider the case when k = n.
To answer your other question, yes it is possible to do this without sorting, by using heaps.
Use a min-heap of 2k elements. Insert 2k elements first, then remove min, insert next element etc.
This guarantees O(n log k) time and O(k) space and heaps usually have small enough hidden constants.
Since k is apparently supposed to be pretty small, an insertion sort is probably the most obvious and generally accepted algorithm.
In an insertion sort on random elements, you have to scan through N elements, and you have to move each one an average of N/2 positions, giving ~N*N/2 total operations. The "/2" constant is ignored in a big-O (or similar) characterization, giving O(N2) complexity.
In the case you're proposing, the expected number of operations is ~N*K/2 -- but since k is a constant, the whole k/2 term is ignored in a big-O characterization, so the overall complexity is O(N).
Your solution is a good one if k is large enough. There is no better solution in terms of time complexity; each element might be out of place by k places, which means you need to learn log2 k bits of information to place it correctly, which means you need to make log2 k comparisons at least--so it's got to be a complexity of at least O(N log k).
However, as others have pointed out, if k is small, the constant terms are going to kill you. Use something that's very fast per operation, like insertion sort, in that case.
If you really wanted to be optimal, you'd implement both methods, and switch from one to the other depending on k.
It was already pointed out that one of the asymptotically optimal solutions uses a min heap and I just wanted to provide code in Java:
public void sortNearlySorted(int[] nums, int k) {
PriorityQueue<Integer> minHeap = new PriorityQueue<>();
for (int i = 0; i < k; i++) {
minHeap.add(nums[i]);
}
for (int i = 0; i < nums.length; i++) {
if (i + k < nums.length) {
minHeap.add(nums[i + k]);
}
nums[i] = minHeap.remove();
}
}

Resources