Find if element in array occurs more frequently than some value - arrays

I have the problem where given an array, A[1...n] of n (not necessarily distinct) integers, find an algorithm to determine whether any item occurs more than ceiling of n/4 times in A.
It seems that the best-possible worst case time is O(n). I am aware of the majority element algorithm and am wondering if this may be applied to this situation. Please let me know if you have any suggestions for approaching this problem.

This is only an idea of an algorithm but I believe it should be possible to make it work.
The main trick is as follows. If you look at any four elements and they are all different, you may throw all four away. (If any of the thrown elements had more than 1/4 frequency in the old array, it will in the new array; if none had, none will).
So you go over an array, throwing away tuples of four, and rearranging the rest. For instance, if you have AABC and then DDEF, you rearrange to AADDBCEF and throw BCEF away. I will let you work out the details, it's not hard. In the end you should be left with pairs of identical elements. Then you throw odd-numbered elements away and repeat.
After each run you may be left with 1, 2 or 3 elements with no pair that you cannot throw away. No worry, you can combine the leftovers of two runs such that there are never more than 3 elements in the leftover pile. E.g. if after run 1 you have A,B,C and after run 2 you have A,D,E you leave just A. Remember that elements from the second run count twice, so in effect you have 3"A, which is more than 1/4 of the total of 9. Keep count of each leftover element to track which of them can be thrown away. (You might be able to just always keep the last leftovers, I have not checked that).
In the end you will have just the leftovers. Check each one against the original array.

There are three possibilities for this element(s), either it's a median of array, or is a median of n/2 smallest elements of array or it's a median of n/2 largest elements of array.
In all cases first find the median of array. After that check whether it occurs in at least n/4 elements, if not then divide array into two part of almost same size (they differ in size by at most one element):
smaller than equal to median
bigger than equal to median
Then find the median of each of these two subarrays and check the number of occurrences of each of them. This is in all O(n). Also in this way you can find all elements with occurrence at least n/4 times.
By the way you can extend this technique to find any element with O(n) time occurrence (e.g n/10000), which works again in O(n).

Related

Find the repeating element in an array in O(1) space(Numbers are not in any range)

Given an array of n integers, all numbers are unique exception one of them.
The repeating number repeats n/2 times if n is even
The repeating number repeats (n-1)/2 or (n+1)/2 times if n is odd
The repeating number is not adjacent to itself in the array
Write a program to find the repeating number without using extra space.
This is how I tried to solve the problem.
If n is even, then there are n/2 repeating elements. Also, the repeating elements should not be adjacent. So if say there are 6 elements, 3 elements are repeated. The elements can be either at indices 0,2 and 4 or 1,3 and 5. So if I just check if any element repeats at index 0 and 2, and then at index 1 and 3, I can get the repeating element.
If n is odd, then there are 2 choices.
If (n+1)/2 elements are repeating, then we can just check indices 0 and 2. For example say there are 7 elements, 4 of them are repeated, then repeating elements have to be at indices 0,2,4 and 6.
However I cannot find a way to find the (n-1)/2 repeating elements when n is odd. I have thought of using xor and sums but can't find a way.
Let us call the element that repeats as the "majority".
Boyer–Moore majority vote algorithm can help here. The algorithm finds an element that occurs repeatedly for more than half of the elements of the input if any such element exists.
But in your case the situation is interesting. The majority may not occur more than half the times. All elements are unique except the repeating one and repeating numbers are not adjacent. Also, majority element exists for sure.
So,
Run majority vote algorithm on numbers at even index in the array. Makes a second pass through the input array to verify that the element reported by the algorithm really is a majority.
If in the above step we don't get the majority element, you can repeat the above procedure for numbers at odd index in the array. You can do this second step a bit more smartly because we know for sure that majority element exists. So, any number that repeats would be the result.
In the implementation of above, there is a good scope for small optimizations.
I think I should not explain the majority vote algorithm here. If you want me to, let me know. Apparently, without knowing this majority algorithm we should be able to do it with some counting logic (which would most probably end up the same as the majority algorithm). But just that it's a standard algorithm, we can leverage it.

Given a source and a final array, find the number of hops to generate final from the source in less than quadratic time complexity

Suppose you have an array of integers (for eg. [1 5 3 4 6]). The elements are rearranged according to the following rule. Every element can hop forward (towards left) and slide the elements in those indices over which it hopped. The process starts with element in second index (i.e. 5). It has a choice to hop over element 1 or it can stay in its own position.If it does choose to hop, element 1 slides down to index 2. Let us assume it does choose to hop and our resultant array will then be [5 1 3 4 6]. Element 3 can now hop over 1 or 2 positions and the process repeats. If 3 hops over one position the array will now be [5 3 1 4 6] and if it hops over two positions it will now be [3 5 1 4 6].
It is very easy to show that all possible permutation of the elements is possible in this way. Also any final configuration can be reached by an unique set of occurrences.
The question is, given a final array and a source array, find the total number of hops required to arrive at the final array from the source. A O(N^2) implementation is easy to find, however I believe this can be done in O(N) or O(NlogN). Also if it is not possible to do better than O(N2) it will be great to know.
For example if the final array is [3,5,1,4,6] and the source array [1,5,3,4,6], the answer will be 3.
My O(N2) algorithm is like this: you loop over all the positions of the source array from the end, since we know that is the last element to move. Here it will be 6 and we check its position in the final array. We calculate the number of hops necessary and need to rearrange the final array to put that element in its original position in the source array. The rearranging step goes over all the elements in the array and the process loops over all the elements, hence O(N^2). Using Hashmap or map can help in searching, but the map needs to be updated after every step which makes in O(N^2).
P.S. I am trying to model correlation between two permutations in a Bayesian way and this is a sub-problem of that. Any ideas on modifying the generative process to make the problem simpler is also helpful.
Edit: I have found my answer. This is exactly what Kendall Tau distance does. There is an easy merge sort based algorithm to find this out in O(NlogN).
Consider the target array as an ordering. A target array [2 5 4 1 3] can be seen as [1 2 3 4 5], just by relabeling. You only have to know the mapping to be able to compare elements in constant time. On this instance, to compare 4 and 5 you check: index[4]=2 > index[5]=1 (in the target array) and so 4 > 5 (meaning: 4 must be to the right of 5 at the end).
So what you really have is just a vanilla sorting problem. The ordering is just different from the usual numerical ordering. The only thing that changes is your comparison function. The rest is basically the same. Sorting can be achieved in O(nlgn), or even O(n) (radix sort). That said, you have some additional constraints: you must sort in-place, and you can only swap two adjacent elements.
A strong and simple candidate would be selection sort, which will do just that in O(n^2) time. On each iteration, you identify the "leftiest" remaining element in the "unplaced" portion and swap it until it lands at the end of the "placed" portion. It can improve to O(nlgn) with the use of an appropriate data structure (priority queue for identifying the "leftiest" remaining element in O(lgn) time). Since nlgn is a lower bound for comparison based sorting, I really don't think you can do better than that.
Edit: So you're not interested in the sequence of swaps at all, only the minimum number of swaps required. This is exactly the number of inversions in the array (adapted to your particular needs: "non natural ordering" comparison function, but it doesn't change the maths). See this answer for a proof of that assertion.
One way to find the number of inversions is to adapt the Merge Sort algorithm. Since you have to actually sort the array to compute it, it turns out to be still O(nlgn) time. For an implementation, see this answer or this (again, remember that you'll have to adapt).
From your answer I assume number of hops is total number of swaps of adjacent elements needed to transform original array to final array.
I suggest to use something like insert sort, but without insertion part - data in arrays will not be altered.
You can make queue t for stalled hoppers as balanced binary search tree with counters (number of elements in subtree).
You can add element to t, remove element from t, balance t and find element position in t in O(log C) time, where C is the number of elements in t.
Few words on finding position of element. It consists of binary search with accumulation of skipped left sides (and middle elements +1 if you keep elements on branches) counts.
Few words on balancing/addition/removal. You have to traverse upward from removed/added element/subtree and update counters. Overall number of operations still hold at O(log C) for insert+balance and remove+balance.
Let's t is that (balanced search tree) queue, p is current original array index, q is final array index, original array is a, final array is f.
Now we have 1 loop starting from left side (say, p=0, q=0):
If a[p] == f[q], then original array element hops over the whole queue. Add t.count to the answer, increment p, increment q.
If a[p] != f[q] and f[q] is not in t, then insert a[p] into t and increment p.
If a[p] != f[q] and f[q] is in t, then add f[q]'s position in queue to answer, remove f[q] from t and increment q.
I like the magic that will ensure this process will move p and q to ends of arrays in the same time if arrays are really permutations of one array. Nevertheless you should probably check p and q overflows to detect incorrect data as we have no really faster way to prove data is correct.

Find the median of an unsorted array without sorting [duplicate]

This question already has answers here:
O(n) algorithm to find the median of n² implicit numbers
(3 answers)
Closed 7 years ago.
is there a way to find the Median of an unsorted array:
1- without sorting it.
2- without using the select algorithm, nor the median of medians
I found a lot of other questions similar to mine. But the solutions, most of them, if not all of them, discussed the SelectProblem and the MedianOfMedians
You can certainly find the median of an array without sorting it. What is not easy is doing that efficiently.
For example, you could just iterate over the elements of the array; for each element, count the number of elements less than and equal to it, until you find a value with the correct count. That will be O(n2) time but only O(1) space.
Or you could use a min heap whose size is just over half the size of the array. (That is, if the array has 2k or 2k+1 elements, then the heap should have k+1 elements.) Build the heap using the first array elements, using the standard heap building algorithm (which is O(N)). Then, for each remaining element x, if x is greater than the heap's minimum, replace the min element with x and do a SiftUp operation (which is O(log N)). At the end, the median is either the heap's minimum element (if the original array's size was odd) or is the average of the two smallest elements in the heap. So that's a total of O(n log n) time, and O(n) space if you cannot rearrange array elements. (If you can rearrange array elements, you can do this in-place.)
There is a randomized algorithm able to accomplish this task in O(n) steps (average case scenario), but it does involve sorting some subsets of the array. And, because of its random nature, there is no guarantee it will actually ever finish (though this unfortunate event should happen with vanishing probability).
I will leave the main idea here. For a more detailed description and for the proof of why this algorithm works, check here.
Let A be your array and let n=|A|. Lets assume all elements of A are distinct. The algorithm goes like this:
Randomly select t = n^(3/4) elements from A.
Let T be the "set" of the selected elements.Sort T.
Set pl = T[t/2-sqrt(n)] and pr = T[t/2+sqrt(n)].
Iterate through the elements of A and determine how many elements are less than pl (denoted by l) and how many are greater than pr (denoted by r). If l > n/2 or r > n/2, go back to step 1.
Let M be the set of elements in A in between pl and pr. M can be determined in step 4, just in case we reach step 5. If the size of M is no more than 4t, sort M. Otherwise, go back to step 1.
Return m = M[n/2-l] as the median element.
The main idea behind the algorithm is to obtain two elements (pl and pr) that enclose the median element (i.e. pl < m < pr) such that these two are very close one two each other in the ordered version of the array (and do this without actually sorting the array). With high probability, all the six steps only need to execute once (i.e. you will get pl and pr with these "good" properties from the first and only pass through step 1-5, so no going back to step 1). Once you find two such elements, you can simply sort the elements in between them and find the median element of A.
Step 2 and Step 5 do involve some sorting (which might be against the "rules" you've mysteriously established :p). If sorting a sub-array is on the table, you should use some sorting method that does this in O(slogs) steps, where s is the size of the array you are sorting. Since T and M are significantly smaller than A the sorting steps take "less than" O(n) steps. If it is also against the rules to sort a sub-array, then take into consideration that in both cases the sorting is not really needed. You only need to find a way to determine pl, pr and m, which is just another selection problem (with respective indices). While sorting T and M does accomplish this, you could use any other selection method (perhaps something rici suggested earlier).
A non-destructive routine selip() is described at http://www.aip.de/groups/soe/local/numres/bookfpdf/f8-5.pdf. It makes multiple passes through the data, at each stage making a random choice of items within the current range of values and then counting the number of items to establish the ranks of the random selection.

Finding Median in Three Sorted Arrays in O(logn)

By googling for minutes, I know the basic idea.
Let A,B,and C be sorted arrays containing n elements.
Pick median in each array and call them medA, medB, and medC.
Without loss of generality, suppose that medA > medB > medC.
The elements bigger than medA in array A cannot become the median of three arrays. Likewise, the elements smaller than medC in array C cannot, so such elements will be ignored.
Repeat steps 2-4 recursively.
My question is, what is the base case?
Assuming a lot of base cases, I tested the algorithm by hands for hours, but I was not able to find a correct base case.
Also, the lengths of three arrays will become different every recursive step. Does step 4 work even if the length of three arrays are different?
This algorithm works for two sorted arrays of same sizes but not three. After the one iteration, you eliminates half of the elements in A and C but leaves B unchanged, so the number of elements in these arrays are no longer the same, and the method no longer apply. For arrays of different sizes, if you apply the same method, you will be removing different number of elements from the lower half and upper half, therefore the median of the remaining elements is not the same as the median of the original arrays.
That being said, you can modify the algorithm to eliminate same number of elements at both end in each iteration, this could be in efficient when some of the arrays are very small and some are very large. You can also turn this into a question of finding the k-th element, track the number of elements being throw away and change value of k at each iteration. Either way this is much trickier than the two array situation.
There is another post talking about a general case: Median of 5 sorted arrays
I think you can use the selection algorithm, slightly modified to handle more arrays.
You're looking for the median, which is the p=[n/2]th element.
Pick the median of the largest array, find for that value the splitting point in the other two arrays (binary search, log(n)). Now you know that the selected number is the kth (k = sum of the positions).
If k > p, discard elements in the 3 arrays above it, if smaller, below it (discarding can be implemented by maintaing lower and upper indexes for each array, separately). If it was smaller, also update p = p - k.
Repeat until k=p.
Oops, I think this is log(n)^2, let me think about it...

Majority Voting Algorithm - WRONG?

A majority voting algorithm decides which element of a sequence is in the majority, provided there is such an element. Here is the most frequently-cited link that I found when I was trying to understand it.
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html
Besides, we have here a link that discusses the problem:
How to find the element of an array that is repeated at least N/2 times?
The problem is that the answer marked as correct is WRONG. Note that the question actually allows the input to have exactly N / 2 copies of a single element (not necessarily more than N / 2 as usually assumed in majority element detection algorithms).
I copied the code and tried it with inputs like [1, 2, 3, 2] and [1, 2, 3, 2, 6, 2] (producing results of 3 and 6). This actually applies as well to the algorithm cited above (which returns "No Majority Element!"). The problem is this: Whenever there's an alternation between the majority element and anything else, the last element in the array that's not the majority element is chosen. Please correct my wrong thoughts if any, and tell me about how to avoid it in the implementation.
The algorithm is correct: there is no majority element in your examples. An element is in the majority only if it is more than 50% of the values.
If you wish to detect the case where the most frequent element has a count of N/2, then I don't see any way to do it in one pass and O(1) space. My best attempt is:
Run the same algorithm as before, but remember the previous candidate as well.
If you switched at the last element, then the correct answer is either your current or your previous candidate.
Run another pass, counting the number of each, and compare them.
OK, so I think I now understand what #sverre is getting at. Here's a proof that it works:
If exactly N/2 elements are the same value (call this value m), N must be even.
Split these elements into two parts: the first N-1 elements, and the last element. Given that a total of N/2 elements are equal to m, then either:
the last element is not m, in which case N/2 of the first N-1 elements are equal to m, and therefore the first N-1 elements have a strict majority m; or
the last element is m, in which case (N/2)-1 of the first N-1 elements are equal to m, and therefore the first N-1 elements do not have a strict majority.
In case 1), m is the candidate just before processing the last element (because, at that point, we've just processed N-1 elements, and we know that a strict majority does exist in this case, so that candidate must be the correct answer).
In case 2), m is the last element itself. (This is the part which was confusing me: in the usual implementation of the algorithm, this would not necessarily become the candidate as it was processed.)
So:
For a strict majority (> N/2 elements the same), the answer (if it exists) is the final candidate.
For a non-strict majority (>= N/2 elements the same), the answer (if it exists) is one of:
the final candidate; or
the candidate just before processing the last element; or
the last element.

Resources