Majority Voting Algorithm - WRONG? - c

A majority voting algorithm decides which element of a sequence is in the majority, provided there is such an element. Here is the most frequently-cited link that I found when I was trying to understand it.
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html
Besides, we have here a link that discusses the problem:
How to find the element of an array that is repeated at least N/2 times?
The problem is that the answer marked as correct is WRONG. Note that the question actually allows the input to have exactly N / 2 copies of a single element (not necessarily more than N / 2 as usually assumed in majority element detection algorithms).
I copied the code and tried it with inputs like [1, 2, 3, 2] and [1, 2, 3, 2, 6, 2] (producing results of 3 and 6). This actually applies as well to the algorithm cited above (which returns "No Majority Element!"). The problem is this: Whenever there's an alternation between the majority element and anything else, the last element in the array that's not the majority element is chosen. Please correct my wrong thoughts if any, and tell me about how to avoid it in the implementation.

The algorithm is correct: there is no majority element in your examples. An element is in the majority only if it is more than 50% of the values.
If you wish to detect the case where the most frequent element has a count of N/2, then I don't see any way to do it in one pass and O(1) space. My best attempt is:
Run the same algorithm as before, but remember the previous candidate as well.
If you switched at the last element, then the correct answer is either your current or your previous candidate.
Run another pass, counting the number of each, and compare them.

OK, so I think I now understand what #sverre is getting at. Here's a proof that it works:
If exactly N/2 elements are the same value (call this value m), N must be even.
Split these elements into two parts: the first N-1 elements, and the last element. Given that a total of N/2 elements are equal to m, then either:
the last element is not m, in which case N/2 of the first N-1 elements are equal to m, and therefore the first N-1 elements have a strict majority m; or
the last element is m, in which case (N/2)-1 of the first N-1 elements are equal to m, and therefore the first N-1 elements do not have a strict majority.
In case 1), m is the candidate just before processing the last element (because, at that point, we've just processed N-1 elements, and we know that a strict majority does exist in this case, so that candidate must be the correct answer).
In case 2), m is the last element itself. (This is the part which was confusing me: in the usual implementation of the algorithm, this would not necessarily become the candidate as it was processed.)
So:
For a strict majority (> N/2 elements the same), the answer (if it exists) is the final candidate.
For a non-strict majority (>= N/2 elements the same), the answer (if it exists) is one of:
the final candidate; or
the candidate just before processing the last element; or
the last element.

Related

Find the repeating element in an array in O(1) space(Numbers are not in any range)

Given an array of n integers, all numbers are unique exception one of them.
The repeating number repeats n/2 times if n is even
The repeating number repeats (n-1)/2 or (n+1)/2 times if n is odd
The repeating number is not adjacent to itself in the array
Write a program to find the repeating number without using extra space.
This is how I tried to solve the problem.
If n is even, then there are n/2 repeating elements. Also, the repeating elements should not be adjacent. So if say there are 6 elements, 3 elements are repeated. The elements can be either at indices 0,2 and 4 or 1,3 and 5. So if I just check if any element repeats at index 0 and 2, and then at index 1 and 3, I can get the repeating element.
If n is odd, then there are 2 choices.
If (n+1)/2 elements are repeating, then we can just check indices 0 and 2. For example say there are 7 elements, 4 of them are repeated, then repeating elements have to be at indices 0,2,4 and 6.
However I cannot find a way to find the (n-1)/2 repeating elements when n is odd. I have thought of using xor and sums but can't find a way.
Let us call the element that repeats as the "majority".
Boyer–Moore majority vote algorithm can help here. The algorithm finds an element that occurs repeatedly for more than half of the elements of the input if any such element exists.
But in your case the situation is interesting. The majority may not occur more than half the times. All elements are unique except the repeating one and repeating numbers are not adjacent. Also, majority element exists for sure.
So,
Run majority vote algorithm on numbers at even index in the array. Makes a second pass through the input array to verify that the element reported by the algorithm really is a majority.
If in the above step we don't get the majority element, you can repeat the above procedure for numbers at odd index in the array. You can do this second step a bit more smartly because we know for sure that majority element exists. So, any number that repeats would be the result.
In the implementation of above, there is a good scope for small optimizations.
I think I should not explain the majority vote algorithm here. If you want me to, let me know. Apparently, without knowing this majority algorithm we should be able to do it with some counting logic (which would most probably end up the same as the majority algorithm). But just that it's a standard algorithm, we can leverage it.

Given a source and a final array, find the number of hops to generate final from the source in less than quadratic time complexity

Suppose you have an array of integers (for eg. [1 5 3 4 6]). The elements are rearranged according to the following rule. Every element can hop forward (towards left) and slide the elements in those indices over which it hopped. The process starts with element in second index (i.e. 5). It has a choice to hop over element 1 or it can stay in its own position.If it does choose to hop, element 1 slides down to index 2. Let us assume it does choose to hop and our resultant array will then be [5 1 3 4 6]. Element 3 can now hop over 1 or 2 positions and the process repeats. If 3 hops over one position the array will now be [5 3 1 4 6] and if it hops over two positions it will now be [3 5 1 4 6].
It is very easy to show that all possible permutation of the elements is possible in this way. Also any final configuration can be reached by an unique set of occurrences.
The question is, given a final array and a source array, find the total number of hops required to arrive at the final array from the source. A O(N^2) implementation is easy to find, however I believe this can be done in O(N) or O(NlogN). Also if it is not possible to do better than O(N2) it will be great to know.
For example if the final array is [3,5,1,4,6] and the source array [1,5,3,4,6], the answer will be 3.
My O(N2) algorithm is like this: you loop over all the positions of the source array from the end, since we know that is the last element to move. Here it will be 6 and we check its position in the final array. We calculate the number of hops necessary and need to rearrange the final array to put that element in its original position in the source array. The rearranging step goes over all the elements in the array and the process loops over all the elements, hence O(N^2). Using Hashmap or map can help in searching, but the map needs to be updated after every step which makes in O(N^2).
P.S. I am trying to model correlation between two permutations in a Bayesian way and this is a sub-problem of that. Any ideas on modifying the generative process to make the problem simpler is also helpful.
Edit: I have found my answer. This is exactly what Kendall Tau distance does. There is an easy merge sort based algorithm to find this out in O(NlogN).
Consider the target array as an ordering. A target array [2 5 4 1 3] can be seen as [1 2 3 4 5], just by relabeling. You only have to know the mapping to be able to compare elements in constant time. On this instance, to compare 4 and 5 you check: index[4]=2 > index[5]=1 (in the target array) and so 4 > 5 (meaning: 4 must be to the right of 5 at the end).
So what you really have is just a vanilla sorting problem. The ordering is just different from the usual numerical ordering. The only thing that changes is your comparison function. The rest is basically the same. Sorting can be achieved in O(nlgn), or even O(n) (radix sort). That said, you have some additional constraints: you must sort in-place, and you can only swap two adjacent elements.
A strong and simple candidate would be selection sort, which will do just that in O(n^2) time. On each iteration, you identify the "leftiest" remaining element in the "unplaced" portion and swap it until it lands at the end of the "placed" portion. It can improve to O(nlgn) with the use of an appropriate data structure (priority queue for identifying the "leftiest" remaining element in O(lgn) time). Since nlgn is a lower bound for comparison based sorting, I really don't think you can do better than that.
Edit: So you're not interested in the sequence of swaps at all, only the minimum number of swaps required. This is exactly the number of inversions in the array (adapted to your particular needs: "non natural ordering" comparison function, but it doesn't change the maths). See this answer for a proof of that assertion.
One way to find the number of inversions is to adapt the Merge Sort algorithm. Since you have to actually sort the array to compute it, it turns out to be still O(nlgn) time. For an implementation, see this answer or this (again, remember that you'll have to adapt).
From your answer I assume number of hops is total number of swaps of adjacent elements needed to transform original array to final array.
I suggest to use something like insert sort, but without insertion part - data in arrays will not be altered.
You can make queue t for stalled hoppers as balanced binary search tree with counters (number of elements in subtree).
You can add element to t, remove element from t, balance t and find element position in t in O(log C) time, where C is the number of elements in t.
Few words on finding position of element. It consists of binary search with accumulation of skipped left sides (and middle elements +1 if you keep elements on branches) counts.
Few words on balancing/addition/removal. You have to traverse upward from removed/added element/subtree and update counters. Overall number of operations still hold at O(log C) for insert+balance and remove+balance.
Let's t is that (balanced search tree) queue, p is current original array index, q is final array index, original array is a, final array is f.
Now we have 1 loop starting from left side (say, p=0, q=0):
If a[p] == f[q], then original array element hops over the whole queue. Add t.count to the answer, increment p, increment q.
If a[p] != f[q] and f[q] is not in t, then insert a[p] into t and increment p.
If a[p] != f[q] and f[q] is in t, then add f[q]'s position in queue to answer, remove f[q] from t and increment q.
I like the magic that will ensure this process will move p and q to ends of arrays in the same time if arrays are really permutations of one array. Nevertheless you should probably check p and q overflows to detect incorrect data as we have no really faster way to prove data is correct.

Find all elements that appear more than n/4 times in linear time

This problem is 4-11 of Skiena. The solution to finding majority elements - repeated more than half times is majority algorithm. Can we use this to find all numbers repeated n/4 times?
Misra and Gries describe a couple approaches. I don't entirely understand their paper, but a key idea is to use a bag.
Boyer and Moore's original majority algorithm paper has a lot of incomprehensible proofs and discussion of formal verification of FORTRAN code, but it has a very good start of an explanation of how the majority algorithm works. The key concept starts with the idea that if the majority of the elements are A and you remove, one at a time, a copy of A and a copy of something else, then in the end you will have only copies of A. Next, it should be clear that removing two different items, neither of which is A, can only increase the majority that A holds. Therefore it's safe to remove any pair of items, as long as they're different. This idea can then be made concrete. Take the first item out of the list and stick it in a box. Take the next item out and stick it in the box. If they're the same, let them both sit there. If the new one is different, throw it away, along with an item from the box. Repeat until all items are either in the box or in the trash. Since the box is only allowed to have one kind of item at a time, it can be represented very efficiently as a pair (item type, count).
The generalization to find all items that may occur more than n/k times is simple, but explaining why it works is a little harder. The basic idea is that we can find and destroy groups of k distinct elements without changing anything. Why? If w > n/k then w-1 > (n-k)/k. That is, if we take away one of the popular elements, and we also take away k-1 other elements, then the popular element remains popular!
Implementation: instead of only allowing one kind of item in the box, allow k-1 of them. Whenever you see a group of k different items show up (that is, there are k-1 types in the box, and the one arriving doesn't match any of them), you throw one of each type in the trash, including the one that just arrived. What data structure should we use for this "box"? Well, a bag, of course! As Misra and Gries explain, if the elements can be ordered, a tree-based bag with O(log k) basic operations will give the whole algorithm a complexity of O(n log k). One point to note is that the operation of removing one of each element is a bit expensive (O(k) for a typical implementation), but that cost is amortized over the arrivals of those elements, so it's no big deal. Of course, if your elements are hashable rather than orderable, you can use a hash-based bag instead, which under certain common assumptions will give even better asymptotic performance (but it's not guaranteed). If your elements are drawn from a small finite set, you can guarantee that. If they can only be compared for equality, then your bag gets much more expensive and I'm pretty sure you end up with something like O(nk) instead.
Find the majority element that appears n/2 times by Moore-Voting Algorithm
See method 3 of the given link for Moore's Voting Algo (http://www.geeksforgeeks.org/majority-element/).
Time:O(n)
Now after finding majority element, scan the array again and remove the majority element or make it -1.
Time:O(n)
Now apply Moore Voting Algorithm on the remaining elements of array (but ignore -1 now as it has already been included earlier). The new majority element appears n/4 times.
Time:O(n)
Total Time:O(n)
Extra Space:O(1)
You can do it for element appearing more than n/8,n/16,.... times
EDIT:
There may exist a case when there is no majority element in the array:
For e.g. if the input arrays is {3, 1, 2, 2, 1, 2, 3, 3} then the output should be [2, 3].
Given an array of of size n and a number k, find all elements that appear more than n/k times
See this link for the answer:
https://stackoverflow.com/a/24642388/3714537
References:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/
See this paper for a solution that uses constant memory and runs in linear time, which will find 3 candidates for elements that occur more than n/4 times. Note that if you assume that your data is given as a stream that you can only go through once, this is the best you can do -- you have to go through the stream one more time to test each of the 3 candidates to see if it occurs more than n/4 times in the stream. However, if you assume a priori that there are 3 elements that occur more than n/4 times then you only need to go through the stream once so you get a linear time online algorithm (only goes through the stream once) that only requires constant storage.
As you didnt mention space complexity , one possible solution is using hashtable for the elements which maps to count then you can just increment count if the element is found.

Find if element in array occurs more frequently than some value

I have the problem where given an array, A[1...n] of n (not necessarily distinct) integers, find an algorithm to determine whether any item occurs more than ceiling of n/4 times in A.
It seems that the best-possible worst case time is O(n). I am aware of the majority element algorithm and am wondering if this may be applied to this situation. Please let me know if you have any suggestions for approaching this problem.
This is only an idea of an algorithm but I believe it should be possible to make it work.
The main trick is as follows. If you look at any four elements and they are all different, you may throw all four away. (If any of the thrown elements had more than 1/4 frequency in the old array, it will in the new array; if none had, none will).
So you go over an array, throwing away tuples of four, and rearranging the rest. For instance, if you have AABC and then DDEF, you rearrange to AADDBCEF and throw BCEF away. I will let you work out the details, it's not hard. In the end you should be left with pairs of identical elements. Then you throw odd-numbered elements away and repeat.
After each run you may be left with 1, 2 or 3 elements with no pair that you cannot throw away. No worry, you can combine the leftovers of two runs such that there are never more than 3 elements in the leftover pile. E.g. if after run 1 you have A,B,C and after run 2 you have A,D,E you leave just A. Remember that elements from the second run count twice, so in effect you have 3"A, which is more than 1/4 of the total of 9. Keep count of each leftover element to track which of them can be thrown away. (You might be able to just always keep the last leftovers, I have not checked that).
In the end you will have just the leftovers. Check each one against the original array.
There are three possibilities for this element(s), either it's a median of array, or is a median of n/2 smallest elements of array or it's a median of n/2 largest elements of array.
In all cases first find the median of array. After that check whether it occurs in at least n/4 elements, if not then divide array into two part of almost same size (they differ in size by at most one element):
smaller than equal to median
bigger than equal to median
Then find the median of each of these two subarrays and check the number of occurrences of each of them. This is in all O(n). Also in this way you can find all elements with occurrence at least n/4 times.
By the way you can extend this technique to find any element with O(n) time occurrence (e.g n/10000), which works again in O(n).

Finding Median in Three Sorted Arrays in O(logn)

By googling for minutes, I know the basic idea.
Let A,B,and C be sorted arrays containing n elements.
Pick median in each array and call them medA, medB, and medC.
Without loss of generality, suppose that medA > medB > medC.
The elements bigger than medA in array A cannot become the median of three arrays. Likewise, the elements smaller than medC in array C cannot, so such elements will be ignored.
Repeat steps 2-4 recursively.
My question is, what is the base case?
Assuming a lot of base cases, I tested the algorithm by hands for hours, but I was not able to find a correct base case.
Also, the lengths of three arrays will become different every recursive step. Does step 4 work even if the length of three arrays are different?
This algorithm works for two sorted arrays of same sizes but not three. After the one iteration, you eliminates half of the elements in A and C but leaves B unchanged, so the number of elements in these arrays are no longer the same, and the method no longer apply. For arrays of different sizes, if you apply the same method, you will be removing different number of elements from the lower half and upper half, therefore the median of the remaining elements is not the same as the median of the original arrays.
That being said, you can modify the algorithm to eliminate same number of elements at both end in each iteration, this could be in efficient when some of the arrays are very small and some are very large. You can also turn this into a question of finding the k-th element, track the number of elements being throw away and change value of k at each iteration. Either way this is much trickier than the two array situation.
There is another post talking about a general case: Median of 5 sorted arrays
I think you can use the selection algorithm, slightly modified to handle more arrays.
You're looking for the median, which is the p=[n/2]th element.
Pick the median of the largest array, find for that value the splitting point in the other two arrays (binary search, log(n)). Now you know that the selected number is the kth (k = sum of the positions).
If k > p, discard elements in the 3 arrays above it, if smaller, below it (discarding can be implemented by maintaing lower and upper indexes for each array, separately). If it was smaller, also update p = p - k.
Repeat until k=p.
Oops, I think this is log(n)^2, let me think about it...

Resources