How can I find a number which occurs an odd number of times in a SORTED array in O(n) time? - c

I have a question and I tried to think over it again and again... but got nothing so posting the question here. Maybe I could get some view-point of others, to try and make it work...
The question is: we are given a SORTED array, which consists of a collection of values occurring an EVEN number of times, except one, which occurs ODD number of times. We need to find the solution in log n time.
It is easy to find the solution in O(n) time, but it looks pretty tricky to perform in log n time.

Theorem: Every deterministic algorithm for this problem probes Ω(log2 n) memory locations in the worst case.
Proof (completely rewritten in a more formal style):
Let k > 0 be an odd integer and let n = k2. We describe an adversary that forces (log2 (k + 1))2 = Ω(log2 n) probes.
We call the maximal subsequences of identical elements groups. The adversary's possible inputs consist of k length-k segments x1 x2 … xk. For each segment xj, there exists an integer bj ∈ [0, k] such that xj consists of bj copies of j - 1 followed by k - bj copies of j. Each group overlaps at most two segments, and each segment overlaps at most two groups.
Group boundaries
| | | | |
0 0 1 1 1 2 2 3 3
| | | |
Segment boundaries
Wherever there is an increase of two, we assume a double boundary by convention.
Group boundaries
| || | |
0 0 0 2 2 2 2 3 3
Claim: The location of the jth group boundary (1 ≤ j ≤ k) is uniquely determined by the segment xj.
Proof: It's just after the ((j - 1) k + bj)th memory location, and xj uniquely determines bj. //
We say that the algorithm has observed the jth group boundary in case the results of its probes of xj uniquely determine xj. By convention, the beginning and the end of the input are always observed. It is possible for the algorithm to uniquely determine the location of a group boundary without observing it.
Group boundaries
| X | | |
0 0 ? 1 2 2 3 3 3
| | | |
Segment boundaries
Given only 0 0 ?, the algorithm cannot tell for sure whether ? is a 0 or a 1. In context, however, ? must be a 1, as otherwise there would be three odd groups, and the group boundary at X can be inferred. These inferences could be problematic for the adversary, but it turns out that they can be made only after the group boundary in question is "irrelevant".
Claim: At any given point during the algorithm's execution, consider the set of group boundaries that it has observed. Exactly one consecutive pair is at odd distance, and the odd group lies between them.
Proof: Every other consecutive pair bounds only even groups. //
Define the odd-length subsequence bounded by the special consecutive pair to be the relevant subsequence.
Claim: No group boundary in the interior of the relevant subsequence is uniquely determined. If there is at least one such boundary, then the identity of the odd group is not uniquely determined.
Proof: Without loss of generality, assume that each memory location not in the relevant subsequence has been probed and that each segment contained in the relevant subsequence has exactly one location that has not been probed. Suppose that the jth group boundary (call it B) lies in the interior of the relevant subsequence. By hypothesis, the probes to xj determine B's location up to two consecutive possibilities. We call the one at odd distance from the left observed boundary odd-left and the other odd-right. For both possibilities, we work left to right and fix the location of every remaining interior group boundary so that the group to its left is even. (We can do this because they each have two consecutive possibilities as well.) If B is at odd-left, then the group to its left is the unique odd group. If B is at odd-right, then the last group in the relevant subsequence is the unique odd group. Both are valid inputs, so the algorithm has uniquely determined neither the location of B nor the odd group. //
Example:
Observed group boundaries; relevant subsequence marked by […]
[ ] |
0 0 Y 1 1 Z 2 3 3
| | | |
Segment boundaries
Possibility #1: Y=0, Z=2
Possibility #2: Y=1, Z=2
Possibility #3: Y=1, Z=1
As a consequence of this claim, the algorithm, regardless of how it works, must narrow the relevant subsequence to one group. By definition, it therefore must observe some group boundaries. The adversary now has the simple task of keeping open as many possibilities as it can.
At any given point during the algorithm's execution, the adversary is internally committed to one possibility for each memory location outside of the relevant subsequence. At the beginning, the relevant subsequence is the entire input, so there are no initial commitments. Whenever the algorithm probes an uncommitted location of xj, the adversary must commit to one of two values: j - 1, or j. If it can avoid letting the jth boundary be observed, it chooses a value that leaves at least half of the remaining possibilities (with respect to observation). Otherwise, it chooses so as to keep at least half of the groups in the relevant interval and commits values for the others.
In this way, the adversary forces the algorithm to observe at least log2 (k + 1) group boundaries, and in observing the jth group boundary, the algorithm is forced to make at least log2 (k + 1) probes.
Extensions:
This result extends straightforwardly to randomized algorithms by randomizing the input, replacing "at best halved" (from the algorithm's point of view) with "at best halved in expectation", and applying standard concentration inequalities.
It also extends to the case where no group can be larger than s copies; in this case the lower bound is Ω(log n log s).

A sorted array suggests a binary search. We have to redefine equality and comparison. Equality simple means an odd number of elements. We can do comparison by observing the index of the first or last element of the group. The first element will be an even index (0-based) before the odd group, and an odd index after the odd group. We can find the first and last elements of a group using binary search. The total cost is O((log N)²).
PROOF OF O((log N)²)
T(2) = 1 //to make the summation nice
T(N) = log(N) + T(N/2) //log(N) is finding the first/last elements
For some N=2^k,
T(2^k) = (log 2^k) + T(2^(k-1))
= (log 2^k) + (log 2^(k-1)) + T(2^(k-2))
= (log 2^k) + (log 2^(k-1)) + (log 2^(k-2)) + ... + (log 2^2) + 1
= k + (k-1) + (k-2) + ... + 1
= k(k+1)/2
= (k² + k)/2
= (log(N)² + log(N))/ 2
= O(log(N)²)

Look at the middle element of the array. With a couple of appropriate binary searches, you can find the first and its last appearance in the array. E.g., if the middle element is 'a', you need to find i and j as shown below:
[* * * * a a a a * * *]
^ ^
| |
| |
i j
Is j - i an even number? You are done! Otherwise (and this is the key here), the question to ask is i an even or an odd number? Do you see what this piece of knowledge implies? Then the rest is easy.

This answer is in support of the answer posted by "throwawayacct". He deserves the bounty. I spent some time on this question and I'm totally convinced that his proof is correct that you need Ω(log(n)^2) queries to find the number that occurs an odd number of times. I'm convinced because I ended up recreating the exact same argument after only skimming his solution.
In the solution, an adversary creates an input to make life hard for the algorithm, but also simple for a human analyzer. The input consists of k pages that each have k entries. The total number of entries is n = k^2, and it is important that O(log(k)) = O(log(n)) and Ω(log(k)) = Ω(log(n)). To make the input, the adversary makes a string of length k of the form 00...011...1, with the transition in an arbitrary position. Then each symbol in the string is expanded into a page of length k of the form aa...abb...b, where on the ith page, a=i and b=i+1. The transition on each page is also in an arbitrary position, except that the parity agrees with the symbol that the page was expanded from.
It is important to understand the "adversary method" of analyzing an algorithm's worst case. The adversary answers queries about the algorithm's input, without committing to future answers. The answers have to be consistent, and the game is over when the adversary has been pinned down enough for the algorithm to reach a conclusion.
With that background, here are some observations:
1) If you want to learn the parity of a transition in a page by making queries in that page, you have to learn the exact position of the transition and you need Ω(log(k)) queries. Any collection of queries restricts the transition point to an interval, and any interval of length more than 1 has both parities. The most efficient search for the transition in that page is a binary search.
2) The most subtle and most important point: There are two ways to determine the parity of a transition inside a specific page. You can either make enough queries in that page to find the transition, or you can infer the parity if you find the same parity in both an earlier and a later page. There is no escape from this either-or. Any set of queries restricts the transition point in each page to some interval. The only restriction on parities comes from intervals of length 1. Otherwise the transition points are free to wiggle to have any consistent parities.
3) In the adversary method, there are no lucky strikes. For instance, suppose that your first query in some page is toward one end instead of in the middle. Since the adversary hasn't committed to an answer, he's free to put the transition on the long side.
4) The end result is that you are forced to directly probe the parities in Ω(log(k)) pages, and the work for each of these subproblems is also Ω(log(k)).
5) Things are not much better with random choices than with adversarial choices. The math is more complicated, because now you can get partial statistical information, rather than a strict yes you know a parity or no you don't know it. But it makes little difference. For instance, you can give each page length k^2, so that with high probability, the first log(k) queries in each page tell you almost nothing about the parity in that page. The adversary can make random choices at the beginning and it still works.

Start at the middle of the array and walk backward until you get to a value that's different from the one at the center. Check whether the number above that boundary is at an odd or even index. If it's odd, then the number occurring an odd number of times is to the left, so repeat your search between the beginning and the boundary you found. If it's even, then the number occurring an odd number of times must be later in the array, so repeat the search in the right half.
As stated, this has both a logarithmic and a linear component. If you want to keep the whole thing logarithmic, instead of just walking backward through the array to a different value, you want to use a binary search instead. Unless you expect many repetitions of the same numbers, the binary search may not be worthwhile though.

I have an algorithm which works in log(N/C)*log(K), where K is the length of maximum same-value range, and C is the length of range being searched for.
The main difference of this algorithm from most posted before is that it takes advantage of the case where all same-value ranges are short. It finds boundaries not by binary-searching the entire array, but by first quickly finding a rough estimate by jumping back by 1, 2, 4, 8, ... (log(K) iterations) steps, and then binary-searching the resulting range (log(K) again).
The algorithm is as follows (written in C#):
// Finds the start of the range of equal numbers containing the index "index",
// which is assumed to be inside the array
//
// Complexity is O(log(K)) with K being the length of range
static int findRangeStart (int[] arr, int index)
{
int candidate = index;
int value = arr[index];
int step = 1;
// find the boundary for binary search:
while(candidate>=0 && arr[candidate] == value)
{
candidate -= step;
step *= 2;
}
// binary search:
int a = Math.Max(0,candidate);
int b = candidate+step/2;
while(a+1!=b)
{
int c = (a+b)/2;
if(arr[c] == value)
b = c;
else
a = c;
}
return b;
}
// Finds the index after the only "odd" range of equal numbers in the array.
// The result should be in the range (start; end]
// The "end" is considered to always be the end of some equal number range.
static int search(int[] arr, int start, int end)
{
if(arr[start] == arr[end-1])
return end;
int middle = (start+end)/2;
int rangeStart = findRangeStart(arr,middle);
if((rangeStart & 1) == 0)
return search(arr, middle, end);
return search(arr, start, rangeStart);
}
// Finds the index after the only "odd" range of equal numbers in the array
static int search(int[] arr)
{
return search(arr, 0, arr.Length);
}

Take the middle element e. Use binary search to find the first and last occurrence. O(log(n))
If it is odd return e.
Otherwise, recurse onto the side that has an odd number of elements [....]eeee[....]
Runtime will be log(n) + log(n/2) + log(n/4).... = O(log(n)^2).

AHhh. There is an answer.
Do a binary search and as you search, for each value, move backwards until you find the first entry with that same value. If its index is even, it is before the oddball, so move to the right.
If its array index is odd, it is after the oddball, so move to the left.
In pseudocode (this is the general idea, not tested...):
private static int FindOddBall(int[] ary)
{
int l = 0,
r = ary.Length - 1;
int n = (l+r)/2;
while (r > l+2)
{
n = (l + r) / 2;
while (ary[n] == ary[n-1])
n = FindBreakIndex(ary, l, n);
if (n % 2 == 0) // even index we are on or to the left of the oddball
l = n;
else // odd index we are to the right of the oddball
r = n-1;
}
return ary[l];
}
private static int FindBreakIndex(int[] ary, int l, int n)
{
var t = ary[n];
var r = n;
while(ary[n] != t || ary[n] == ary[n-1])
if(ary[n] == t)
{
r = n;
n = (l + r)/2;
}
else
{
l = n;
n = (l + r)/2;
}
return n;
}

You can use this algorithm:
int GetSpecialOne(int[] array, int length)
{
int specialOne = array[0];
for(int i=1; i < length; i++)
{
specialOne ^= array[i];
}
return specialOne;
}
Solved with the help of a similar question which can be found here on http://www.technicalinterviewquestions.net

We don't have any information about the distribution of lenghts inside the array, and of the array as a whole, right?
So the arraylength might be 1, 11, 101, 1001 or something, 1 at least with no upper bound, and must contain at least 1 type of elements ('number') up to (length-1)/2 + 1 elements, for total sizes of 1, 11, 101: 1, 1 to 6, 1 to 51 elements and so on.
Shall we assume every possible size of equal probability? This would lead to a middle length of subarrays of size/4, wouldn't it?
An array of size 5 could be divided into 1, 2 or 3 sublists.
What seems to be obvious is not that obvious, if we go into details.
An array of size 5 can be 'divided' into one sublist in just one way, with arguable right to call it 'dividing'. It's just a list of 5 elements (aaaaa). To avoid confusion let's assume the elements inside the list to be ordered characters, not numbers (a,b,c, ...).
Divided into two sublist, they might be (1, 4), (2, 3), (3, 2), (4, 1). (abbbb, aabbb, aaabb, aaaab).
Now let's look back at the claim made before: Shall the 'division' (5) be assumed the same probability as those 4 divisions into 2 sublists? Or shall we mix them together, and assume every partition as evenly probable, (1/5)?
Or can we calculate the solution without knowing the probability of the length of the sublists?

The clue is you're looking for log(n). That's less than n.
Stepping through the entire array, one at a time? That's n. That's not going to work.
We know the first two indexes in the array (0 and 1) should be the same number. Same with 50 and 51, if the odd number in the array is after them.
So find the middle element in the array, compare it to the element right after it. If the change in numbers happens on the wrong index, we know the odd number in the array is before it; otherwise, it's after. With one set of comparisons, we figure out which half of the array the target is in.
Keep going from there.

Use a hash table
For each element E in the input set
if E is set in the hash table
increment it's value
else
set E in the hash table and initialize it to 0
For each key K in hash table
if K % 2 = 1
return K
As this algorithm is 2n it belongs to O(n)

Try this:
int getOddOccurrence(int ar[], int ar_size)
{
int i;
int xor = 0;
for (i=0; i < ar_size; i++)
xor = xor ^ ar[i];
return res;
}
XOR will cancel out everytime you XOR with the same number so 1^1=0 but 1^1^1=1 so every pair should cancel out leaving the odd number out.

Assume indexing start at 0. Binary search for the smallest even i such that x[i] != x[i+1]; your answer is x[i].
edit: due to public demand, here is the code
int f(int *x, int min, int max) {
int size = max;
min /= 2;
max /= 2;
while (min < max) {
int i = (min + max)/2;
if (i==0 || x[2*i-1] == x[2*i])
min = i+1;
else
max = i-1;
}
if (2*max == size || x[2*max] != x[2*max+1])
return x[2*max];
return x[2*min];
}

Related

Given an array of integers of size n+1 consisting of the elements [1,n]. All elements are unique except one which is duplicated k times

I have been attempting to solve the following problem:
You are given an array of n+1 integers where all the elements lies in [1,n]. You are also given that one of the elements is duplicated a certain number of times, whilst the others are distinct. Develop an algorithm to find both the duplicated number and the number of times it is duplicated.
Here is my solution where I let k = number of duplications:
struct LatticePoint{ // to hold duplicate and k
int a;
int b;
LatticePoint(int a_, int b_) : a(a_), b(b_) {}
}
LatticePoint findDuplicateAndK(const std::vector<int>& A){
int n = A.size() - 1;
std::vector<int> Numbers (n);
for(int i = 0; i < n + 1; ++i){
++Numbers[A[i] - 1]; // A[i] in range [1,n] so no out-of-access
}
int i = 0;
while(i < n){
if(Numbers[i] > 1) {
int duplicate = i + 1;
int k = Numbers[i] - 1;
LatticePoint result{duplicate, k};
return LatticePoint;
}
So, the basic idea is this: we go along the array and each time we see the number A[i] we increment the value of Numbers[A[i]]. Since only the duplicate appears more than once, the index of the entry of Numbers with value greater than 1 must be the duplicate number with the value of the entry the number of duplications - 1. This algorithm of O(n) in time complexity and O(n) in space.
I was wondering if someone had a solution that is better in time and/or space? (or indeed if there are any errors in my solution...)
You can reduce the scratch space to n bits instead of n ints, provided you either have or are willing to write a bitset with run-time specified size (see boost::dynamic_bitset).
You don't need to collect duplicate counts until you know which element is duplicated, and then you only need to keep that count. So all you need to track is whether you have previously seen the value (hence, n bits). Once you find the duplicated value, set count to 2 and run through the rest of the vector, incrementing count each time you hit an instance of the value. (You initialise count to 2, since by the time you get there, you will have seen exactly two of them.)
That's still O(n) space, but the constant factor is a lot smaller.
The idea of your code works.
But, thanks to the n+1 elements, we can achieve other tradeoffs of time and space.
If we have some number of buckets we're dividing numbers between, putting n+1 numbers in means that some bucket has to wind up with more than expected. This is a variant on the well-known pigeonhole principle.
So we use 2 buckets, one for the range 1..floor(n/2) and one for floor(n/2)+1..n. After one pass through the array, we know which half the answer is in. We then divide that half into halves, make another pass, and so on. This leads to a binary search which will get the answer with O(1) data, and with ceil(log_2(n)) passes, each taking time O(n). Therefore we get the answer in time O(n log(n)).
Now we don't need to use 2 buckets. If we used 3, we'd take ceil(log_3(n)) passes. So as we increased the fixed number of buckets, we take more space and save time. Are there other tradeoffs?
Well you showed how to do it in 1 pass with n buckets. How many buckets do you need to do it in 2 passes? The answer turns out to be at least sqrt(n) bucekts. And 3 passes is possible with the cube root. And so on.
So you get a whole family of tradeoffs where the more buckets you have, the more space you need, but the fewer passes. And your solution is merely at the extreme end, taking the most spaces and the least time.
Here's a cheekier algorithm, which requires only constant space but rearranges the input vector. (It only reorders; all the original elements are still present at the end.)
It's still O(n) time, although that might not be completely obvious.
The idea is to try to rearrange the array so that A[i] is i, until we find the duplicate. The duplicate will show up when we try to put an element at the right index and it turns out that that index already holds that element. With that, we've found the duplicate; we have a value we want to move to A[j] but the same value is already at A[j]. We then scan through the rest of the array, incrementing the count every time we find another instance.
#include <utility>
#include <vector>
std::pair<int, int> count_dup(std::vector<int> A) {
/* Try to put each element in its "home" position (that is,
* where the value is the same as the index). Since the
* values start at 1, A[0] isn't home to anyone, so we start
* the loop at 1.
*/
int n = A.size();
for (int i = 1; i < n; ++i) {
while (A[i] != i) {
int j = A[i];
if (A[j] == j) {
/* j is the duplicate. Now we need to count them.
* We have one at i. There's one at j, too, but we only
* need to add it if we're not going to run into it in
* the scan. And there might be one at position 0. After that,
* we just scan through the rest of the array.
*/
int count = 1;
if (A[0] == j) ++count;
if (j < i) ++count;
for (++i; i < n; ++i) {
if (A[i] == j) ++count;
}
return std::make_pair(j, count);
}
/* This swap can only happen once per element. */
std::swap(A[i], A[j]);
}
}
/* If we get here, every element from 1 to n is at home.
* So the duplicate must be A[0], and the duplicate count
* must be 2.
*/
return std::make_pair(A[0], 2);
}
A parallel solution with O(1) complexity is possible.
Introduce an array of atomic booleans and two atomic integers called duplicate and count. First set count to 1. Then access the array in parallel at the index positions of the numbers and perform a test-and-set operation on the boolean. If a boolean is set already, assign the number to duplicate and increment count.
This solution may not always perform better than the suggested sequential alternatives. Certainly not if all numbers are duplicates. Still, it has constant complexity in theory. Or maybe linear complexity in the number of duplicates. I am not quite sure. However, it should perform well when using many cores and especially if the test-and-set and increment operations are lock-free.

Find the element occuring once in an array where all other elements occur twice (without using XOR)

I have tried solving this for so long but I can't seem to be able to.
The question is as follows:
Given an array n numbers where all of the numbers in it occur twice except for one, which occurs only once, find the number that occurs only once.
Now, I have found many solutions online for this, but none of them satisfy the additional constraints of the question.
The solution should:
Run in linear time (aka O(n)).
Not use hash tables.
Assume that computer supports only comparison and the arithmetic (addition, subtraction, multiplication, division).
The number of bits in each number in the array is about O(log(n)).
Therefore, trying something like this https://stackoverflow.com/a/4772568/7774315 using the XOR operator isn't possible, since we don't have the XOR operator. Since the number of bits in each number is about O(log(n)), trying to implement the XOR operator using normal arithmetic (bit by bit) will take about O(log(n)) actions, which will give us an overall solution of O(nlog(n)).
The closest I have come to solving it is if I had a way to get the sum of all unique values in the array in linear time, I could subtract twice that sum from the overall sum to get (negative) the element that occurs only once, because if the numbers that appear twice are {a1,a2,....,ak} and the number that appears once is x, then the overall sum is
sum=2(a1+...+ak)+x
As far as I know, sets are implemented using hash tables, so using them to find the sum of all unique values is no good.
Let's imagine we had a way to find the exact median in linear time and partition the array so all greater elements are on one side and smaller elements on the other. By the parity of expected number of elements, we could identify which side the target element is in. Now perform this routine recursively in the section we identified. Since the section is halved in size each time, the total number of elements traversed cannot exceed O(2n) = O(n).
The key element in the question seems to be this one:
The number of bits in each number in the array is about O(log(n)).
The issue is that this clue is vague a little bit.
A first approach is to consider that the maximum value is O(n). Then a counting sort can be performed in O(n) operations and O(n) memory.
It will consists in finding the maximum value MAX, setting an integer array C[MAX] and performing directly a classical counting sort thanks to it
C[a[i]]++;
Looking for an odd value in array C[] will provide the solution.
A second approach, I guess more efficient, would be to set an array of size n, each element consisting of an array of unknown size. Then, a kind of almost counting sort would consists in :
C[a[i]%n].append (a[i]);
To find the unique element, we then have to find a sub-array of odd size, and then to examine the elements in this sub-array.
The maximum size k of each sub-array will be about 2*(MAX/n). According to the clue, this value should be very low. Dealing with this sub-array has a complexity O(k), for example by performing a counting sort on the b[j]/n, all the elements being equal modulo n.
We can note that practically, this is equivalent to perform a kind of ad-hoc hashing.
Global complexity is O(n + MAX/n).
This should do the trick as long as your a dealing with integers of size O(log n). It is a Python implementation of the algorithm sketched #גלעד ברקן answer (including #OneLyner comments), where the median is replaced by a mean or mid-value.
def mean(items):
result = 0
for i, item in enumerate(items, 1):
result = (result * (i - 1) + item) / i
return result
def midval(items):
min_val = max_val = items[0]
for item in items:
if item < min_val:
min_val = item
elif item > max_val:
max_val = item
return (max_val - min_val) / 2
def find_singleton(items, pivoting=mean):
n = len(items)
if n == 1:
return items[0]
else:
# find pivot - O(n)
pivot = pivoting(items)
# partition the items - O(n)
j = 0
for i, item in enumerate(items):
if item > pivot:
items[j], items[i] = items[i], items[j]
j += 1
# recursion on the partition with odd number of elements
if j % 2:
return find_singleton(items[:j])
else:
return find_singleton(items[j:])
The following code is just for some sanity-checking on random inputs:
def gen_input(n, randomize=True):
"""Generate inputs with unique pairs except one, with size (2 * n + 1)."""
items = sorted(set(random.randint(-n, n) for _ in range(n)))[:n]
singleton = items[-1]
items = items + items[:-1]
if randomize:
random.shuffle(items)
return items, singleton
items, singleton = gen_input(100)
print(singleton, len(items), items.index(singleton), items)
print(find_singleton(items, mean))
print(find_singleton(items, midval))
For a symmetric distribution the median and the mean or mid-value coincide.
With the log(n) requirement on the number of bits for the entries, one
can show that any arbitrary sub-sampling cannot be skewed enough to provide more than log(n) recursions.
For example, considering the case of k = log(n) bits with k = 4 and only positive numbers, the worst case is: [0, 1, 1, 2, 2, 4, 4, 8, 8, 16, 16]. Here pivoting by the mean will reduce the input by 2 at time, resulting in k + 1 recursive calls, but adding any other couple to the input will not increase the number of recursive calls, while it will increase the input size.
(EDITED to provide a better explanation.)
Here is an (unoptimized) implementation of the idea sketched by גלעד ברקן .
I'm using Median_of_medians to get a value close enough to the median to ensure the linear time in the worst case.
NB: this in fact uses only comparisons, and is O(n) whatever the size of the integers as long as comparisons and copies are counted as O(1).
def median_small(L):
return sorted(L)[len(L)//2]
def median_of_medians(L):
if len(L) < 20:
return median_small(L)
return median_of_medians([median_small(L[i:i+5]) for i in range(0, len(L), 5)])
def find_single(L):
if len(L) == 1:
return L[0]
pivot = median_of_medians(L)
smaller = [i for i in L if i <= pivot]
bigger = [i for i in L if i > pivot]
if len(smaller) % 2:
return find_single(smaller)
else:
return find_single(bigger)
This version needs O(n) additional space, but could be implemented with O(1).

Find a unique integer in an array

I am looking for an algorithm to solve the following problem: We are given an integer array of size n which contains k (0 < k < n) many elements exactly once. Every other integer occurs an even number of times in the array. The output should be any of the k unique numbers. k is a fixed number and not part of the input.
An example would be the input [1, 2, 2, 4, 4, 2, 2, 3] with both 1 and 3 being a correct output.
Most importantly, the algorithm should run in O(n) time and require only O(1) additional space.
edit: There has been some confusion regarding whether there is only one unique integer or multiple. I apologize for this. The correct problem is that there is an arbitrary but fixed amount. I have updated the original question above.
"Dante." gave a good answer for the case that there are at most two such numbers. This link also provides a solution for three. "David Eisenstat" commented that it is also possible to do for any fixed k. I would be grateful for a solution.
There is a standard algorithm to solve such problems using XOR operator:
Time Complexity = O(n)
Space Complexity = O(1)
Suppose your input array contains only one element that occurs odd no of times and rest occur even number of times,we take advantage of the following fact:
Any expression having even number of 0's and 1's in any order will always be = 0 when xor is applied.
That is
0^1^....... = 0 as long as number of 0 is even and number of 1 is even
and 0 and 1 can occur in any order.
Because all numbers that occur even number of times will have their corresponding bits form even number of 1's and 0's and only the number which occurs only once will have its bit left out when we take xor of all elements of array because
0(from no's occuring even times)^1(from no occuring once) = 1
0(from no's occuring even times)^0(from no occuring once) = 0
as you can see the bit of only the number occuring once is preserved.
This means when given such an array and you take xor of all the elements,the result is the number which occurs only once.
So the algorithm for array of length n is:
result = array[0]^array[1]^.....array[n-1]
Different Scenario
As the OP mentioned that input can also be an array which has two numbers occuring only once and rest occur even number of times.
This is solved using the same logic as above but with little difference.
Idea of algorithm:
If you take xor of all the elements then definitely all the bits of elements occuring even number of times will result in 0,which means:
The result will have its bit 1 only at that bit position where the bits of the two numbers occuring only once differ.
We will use the above idea.
Now we focus on the resultant xor bit which is 1(any bit which is 1) and make rest 0.The result is a number which will allow us to differentiate between the two numbers(the required ones).
Because the bit is 1,it means they differ at this position,it means one will have 0 at this position and one will have 1.This means one number when taken AND results in 0 and one does not.
Since it is very easy to set the right most bit,we set it of the result xor as
A = result & ~(result-1)
Now traverse through the array once and if array[i]&A is 0 store the number in variable number_1 as
number_1 = number_1^array[i]
otherwise
number_2 = number_2^array[i]
Because the remaining numbers occur even number of times,their bit will automatically disappear.
So the algorithm is
1.Take xor of all elements,call it xor.
2.Set the rightmost bit of xor and store it in B.
3.Do the following:
number_1=0,number_2=0;
for(i = 0 to n-1)
{
if(array[i] & B)
number_1 = number_1^array[i];
else
number_2 = number_2^array[i];
}
The number_1 and number_2 are the required numbers.
Here's a Las Vegas algorithm that, given k, the exact number of elements that occur an odd number of times, reports all of them in expected time O(n k) (read: linear-time when k is O(1)) and space O(1) words, assuming that "give me a uniform random word" and "give me the number of 1 bits set in this word (popcount)" are constant-time operations. I'm pretty sure that I'm not the first person to come up with this algorithm (and I'm not even sure that I'm remembering all of the refinements), but I've reached the limits of my patience trying to find it.
The central technique is called random restrictions. Essentially what we do is to filter the input randomly by value, in the hope that we retain exactly one odd-count element. We apply the classic XOR algorithm to the filtered array and check the result; if it succeeded, then we pretend to add it to the array, to make it even-count. Repeat until all k elements are found.
The filtration process goes like this. Treat each input word x as a binary vector of length w (doesn't matter what w is). Compute a random binary matrix A of size w by ceil(1 + lg k) and a random binary vector b of length ceil(1 + lg k). We filter the input by retaining those x such that Ax = b, where the left-hand side is a matrix multiplication mod 2. In implementation, A is represented as ceil(1 + lg k) vectors a1, a2, .... We compute the bits of Ax as popcount(a1 ^ x), popcount(a2 ^ x), .... (This is convenient because we can short-circuit the comparison with b, which shaves a factor lg k from the running time.)
The analysis is to show that, in a given pass, we manage with constant probability to single out one of the odd-count elements. First note that, for some fixed x, the probability that Ax = b is 2-ceil(1 + lg k) = Θ(1/k). Given that Ax = b, for all y ≠ x, the probability that Ay = b is less than 2-ceil(1 + lg k). Thus, the expected number of elements that accompany x is less than 1/2, so with probability more than 1/2, x is unique in the filtered input. Sum over all k odd-count elements (these events are disjoint), and the probability is Θ(1).
Here's a deterministic linear-time algorithm for k = 3. Let the odd-count elements be a, b, c. Accumulate the XOR of the array, which is s = a ^ b ^ c. For each bit i, observe that, if a[i] == b[i] == c[i], then s[i] == a[i] == b[i] == c[i]. Make another pass through the array, accumulate the XOR of the lowest bit set in s ^ x. The even-count elements contribute nothing again. Two of the odd-count elements contribute the same bit and cancel each other out. Thus, the lowest bit set in the XOR is where exactly one of the odd-count elements differs from s. We can use the restriction method above to find it, then the k = 2 method to find the others.
The question title says "the unique integer", but the question body says there can be more than one unique element.
If there is in fact only one non-duplicate: XOR all the elements together. The duplicates all cancel, because they come in pairs (or higher multiples of 2), so the result is the unique integer.
See Dante's answer for an extension of this idea that can handle two unique elements. It can't be generalized to more than that.
Perhaps for k unique elements, we could use k accumulators to track sum(a[i]**k). i.e. a[i], a[i]2, etc. This probably only works for Faster algorithm to find unique element between two arrays?, not this case where the duplicates are all in one array. IDK if an xor of squares, cubes, etc. would be any use for resolving things.
Track the counts for each element and only return the elements with a count of 1. This can be done with a hash map. The below example tracks the result using a hash set while it's still building the counts map. Still O(n) but less efficient, but I think it's slightly more instructive.
Javascript with jsfiddle http://jsfiddle.net/nmckchsa/
function findUnique(arr) {
var uniq = new Map();
var result = new Set();
// iterate through array
for(var i=0; i<arr.length; i++) {
var v = arr[i];
// add value to map that contains counts
if(uniq.has(v)) {
uniq.set(v, uniq.get(v) + 1);
// count is greater than 1 remove from set
result.delete(v);
} else {
uniq.set(v, 1);
// add a possibly uniq value to the set
result.add(v);
}
}
// set to array O(n)
var a = [], x = 0;
result.forEach(function(v) { a[x++] = v; });
return a;
}
alert(findUnique([1,2,3,0,1,2,3,1,2,3,5,4,4]));
EDIT Since the non-uniq numbers appear an even number of times #PeterCordes suggested a more elegant set toggle.
Here's how that would look.
function findUnique(arr) {
var result = new Set();
// iterate through array
for(var i=0; i<arr.length; i++) {
var v = arr[i];
if(result.has(v)) { // even occurances
result.delete(v);
} else { // odd occurances
result.add(v);
}
}
// set to array O(n)
var a = [], x = 0;
result.forEach(function(v) { a[x++] = v; });
return a;
}
JSFiddle http://jsfiddle.net/hepsyqyw/
Assuming you have an input array: [2,3,4,2,4]
Output: 3
In Ruby, you can do something as simple as this:
[2,3,4,2,4].inject(0) {|xor, v| xor ^ v}
Create an array counts that has INT_MAX slots, with each element initialized to zero.
For each element in the input list, increment counts[element] by one. (edit: actually, you will need to do counts[element] = (counts_element+1)%2, or else you might overflow the value for really ridiculously large values of N. It's acceptable to do this kind of modulus counting because all duplicate items appear an even number of times)
Iterate through counts until you find a slot that contains "1". Return the index of that slot.
Step 2 is O(N) time. Steps 1 and 3 take up a lot of memory and a lot of time, but neither one is proportional to the size of the input list, so they're still technically O(1).
(note: this assumes that integers have a minimum and maximum value, as is the case for many programming languages.)

Need idea for solving this algorithm puzzle

I've came across some similar problems to this one in the past, and I still haven't got good idea how to solve this problem. Problem goes like this:
You are given an positive integer array with size n <= 1000 and k <= n which is the number of contiguous subarrays that you will have to split your array into. You have to output minimum m, where m = max{s[1],..., s[k]}, and s[i] is the sum of the i-th subarray. All integers in the array are between 1 and 100. Example :
Input: Output:
5 3 >> n = 5 k = 3 3
2 1 1 2 3
Splitting array into 2+1 | 1+2 | 3 will minimize the m.
My brute force idea was to make first subarray end at position i (for all possible i) and then try to split the rest of the array in k-1 subarrays in the best way possible. However, this is exponential solution and will never work.
So I'm looking for good ideas to solve it. If you have one please tell me.
Thanks for your help.
You can use dynamic programming to solve this problem, but you can actually solve with greedy and binary search on the answer. This algorithm's complexity is O(n log d), where d is the output answer. (An upper bound would be the sum of all the elements in the array.) (or O( n d ) in the size of the output bits)
The idea is to binary search on what your m would be - and then greedily move forward on the array, adding the current element to the partition unless adding the current element pushes it over the current m -- in that case you start a new partition. The current m is a success (and thus adjust your upper bound) if the numbers of partition used is less than or equal to your given input k. Otherwise, you used too many partitions, and raise your lower bound on m.
Some pseudocode:
// binary search
binary_search ( array, N, k ) {
lower = max( array ), upper = sum( array )
while lower < upper {
mid = ( lower + upper ) / 2
// if the greedy is good
if partitions( array, mid ) <= k
upper = mid
else
lower = mid
}
}
partitions( array, m ) {
count = 0
running_sum = 0
for x in array {
if running_sum + x > m
running_sum = 0
count++
running_sum += x
}
if running_sum > 0
count++
return count
}
This should be easier to come up with conceptually. Also note that because of the monotonic nature of the partitions function, you can actually skip the binary search and do a linear search, if you are sure that the output d is not too big:
for i = 0 to infinity
if partitions( array, i ) <= k
return i
Dynamic programming. Make an array
int best[k+1][n+1];
where best[i][j] is the best you can achieve splitting the first j elements of the array int i subarrays. best[1][j] is simply the sum of the first j array elements. Having row i, you calculate row i+1 as follows:
for(j = i+1; j <= n; ++j){
temp = min(best[i][i], arraysum[i+1 .. j]);
for(h = i+1; h < j; ++h){
if (min(best[i][h], arraysum[h+1 .. j]) < temp){
temp = min(best[i][h], arraysum[h+1 .. j]);
}
}
best[i+1][j] = temp;
}
best[m][n] will contain the solution. The algorithm is O(n^2*k), probably something better is possible.
Edit: a combination of the ideas of ChingPing, toto2, Coffee on Mars and rds (in the order they appear as I currently see this page).
Set A = ceiling(sum/k). This is a lower bound for the minimum. To find a good upper bound for the minimum, create a good partition by any of the mentioned methods, moving borders until you don't find any simple move that still decreases the maximum subsum. That gives you an upper bound B, not much larger than the lower bound (if it were much larger, you'd find an easy improvement by moving a border, I think).
Now proceed with ChingPing's algorithm, with the known upper bound reducing the number of possible branches. This last phase is O((B-A)*n), finding B unknown, but I guess better than O(n^2).
I have a sucky branch and bound algorithm ( please dont downvote me )
First take the sum of array and dvide by k, which gives you the best case bound for you answer i.e. the average A. Also we will keep a best solution seen so far for any branch GO ( global optimal ).Lets consider we put a divider( logical ) as a partition unit after some array element and we have to put k-1 partitions. Now we will put the partitions greedily this way,
Traverse the array elements summing them up until you see that at the next position we will exceed A, now make two branches one where you put the divider at this position and other where you put at next position, Do this recursiely and set GO = min (GO, answer for a branch ).
If at any point in any branch we have a partition greater then GO or the no of position are less then the partitions left to be put we bound. In the end you should have GO as you answer.
EDIT:
As suggested by Daniel, we could modify the divider placing strategy a little to place it until you reach sum of elements as A or the remaining positions left are less then the dividers.
This is just a sketch of an idea... I'm not sure that it works, but it's very easy (and probably fast too).
You start say by putting the separations evenly distributed (it does not actually matter how you start).
Make the sum of each subarray.
Find the subarray with the largest sum.
Look at the right and left neighbor subarrays and move the separation on the left by one if the subarray on the left has a lower sum than the one on the right (and vice-versa).
Redo for the subarray with the current largest sum.
You'll reach some situation where you'll keep bouncing the separation between the same two positions which will probably mean that you have the solution.
EDIT: see the comment by #rds. You'll have to think harder about bouncing solutions and the end condition.
My idea, which unfortunately does not work:
Split the array in N subarrays
Locate the two contiguous subarrays whose sum is the least
Merge the subarrays found in step 2 to form a new contiguous subarray
If the total number of subarrays is greater than k, iterate from step 2, else finish.
If your array has random numbers, you can hope that a partition where each subarray has n/k is a good starting point.
From there
Evaluate this candidate solution, by computing the sums
Store this candidate solution. For instance with:
an array of the indexes of every sub-arrays
the corresponding maximum of sum over sub-arrays
Reduce the size of the max sub-array: create two new candidates: one with the sub-array starting at index+1 ; one with sub-array ending at index-1
Evaluate the new candidates.
If their maximum is higher, discard
If their maximum is lower, iterate on 2, except if this candidate was already evaluated, in which case it is the solution.

Finding the maximum subsequence binary sets that have an equal number of 1s and 0s

I found the following problem on the internet, and would like to know how I would go about solving it:
You are given an array ' containing 0s and 1s. Find O(n) time and O(1) space algorithm to find the maximum sub sequence which has equal number of 1s and 0s.
Examples:
10101010 -
The longest sub sequence that satisfies the problem is the input itself
1101000 -
The longest sub sequence that satisfies the problem is 110100
Update.
I have to completely rephrase my answer. (If you had upvoted the earlier version, well, you were tricked!)
Lets sum up the easy case again, to get it out of the way:
Find the longest prefix of the bit-string containing
an equal number of 1s and 0s of the
array.
This is trivial: A simple counter is needed, counting how many more 1s we have than 0s, and iterating the bitstring while maintaining this. The position where this counter becomes zero for the last time is the end of the longest sought prefix. O(N) time, O(1) space. (I'm completely convinced by now that this is what the original problem asked for. )
Now lets switch to the more difficult version of the problem: we no longer require subsequences to be prefixes - they can start anywhere.
After some back and forth thought, I thought there might be no linear algorithm for this. For example, consider the prefix "111111111111111111...". Every single 1 of those may be the start of the longest subsequence, there is no candidate subsequence start position that dominates (i.e. always gives better solutions than) any other position, so we can't throw away any of them (O(N) space) and at any step, we must be able to select the best start (which has an equal number of 1s and 0s to the current position) out of linearly many candidates, in O(1) time. It turns out this is doable, and easily doable too, since we can select the candidate based on the running sum of 1s (+1) and 0s (-1), this has at most size N, and we can store the first position we reach each sum in 2N cells - see pmod's answer below (yellowfog's comments and geometric insight too).
Failing to spot this trick, I had replaced a fast but wrong with a slow but sure algorithm, (since correct algorithms are preferable to wrong ones!):
Build an array A with the accumulated number of 1s from the start to that position, e.g. if the bitstring is "001001001", then the array would be [0, 0, 1, 1, 1, 2, 2, 2, 3]. Using this, we can test in O(1) whether the subsequence (i,j), inclusive, is valid: isValid(i, j) = (j - i + 1 == 2 * (A[j] - A[i - 1]), i.e. it is valid if its length is double the amount of 1s in it. For example, the subsequence (3,6) is valid because 6 - 3 + 1 == 2 * A[6] - A[2] = 4.
Plain old double loop:
maxSubsLength = 0
for i = 1 to N - 1
for j = i + 1 to N
if isValid(i, j) ... #maintain maxSubsLength
end
end
This can be sped up a bit using some branch-and-bound by skipping i/j sequences which are shorter than the current maxSubsLength, but asymptotically this is still O(n^2). Slow, but with a big plus on its side: correct!
Strictly speaking, the answer is that no such algorithm exists because the language of strings consisting of an equal number of zeros and ones is not regular.
Of course everyone ignores that fact that storing an integer of magnitude n is O(log n) in space and treats it as O(1) in space. :-) Pretty much all big-O's, including time ones, are full of (or rather empty of) missing log n factors, or equivalently, they assume n is bounded by the size of a machine word, which means you're really looking at a finite problem and everything is O(1).
New solution:
Suppose we have for n-bit input bit-array 2*n-size array to keep position of bit. So, the size of array element must have enough size to keep maximum position number. For 256 input bit array, it's needed 256x2 array of bytes (byte is enough to keep 255 - the maximum position).
Moving from the first position of bit-array we put the position into array starting from the middle of array (index is n) using a rule:
1. Increment the position if we passed "1" bit and decrement when passed "0" bit
2. When meet already initialized array element - don't change it and remember the difference between positions (current minus taken from array element) - this is a size of local maximum sequence.
3. Every time we meet local maximum compare it with the global maximum and update if the latter is less.
For example: bit sequence is 0,0,0,1,0,1
initial array index is n
set arr[n] = 0 (position)
bit 0 -> index--
set arr[n-1] = 1
bit 0 -> index--
set arr[n-2] = 2
bit 0 -> index--
set arr[n-3] = 3
bit 1 -> index++
arr[n-2] already contains 2 -> thus, local max seq is [3,2] becomes abs. maximum
will not overwrite arr[n-2]
bit 0 -> index--
arr[n-3] already contains 3 -> thus, local max seq is [4,3] is not abs. maximum
bit 1 -> index++
arr[n-2] already contains 2 -> thus, local max seq is [5,2] is abs. max
Thus, we passing through the whole bit array only once.
Does this solves the task?
input:
n - number of bits
a[n] - input bit-array
track_pos[2*n] = {0,};
ind = n;
/* start from position 1 since zero has
meaning track_pos[x] is not initialized */
for (i = 1; i < n+1; i++) {
if (track_pos[ind]) {
seq_size = i - track_pos[ind];
if (glob_seq_size < seq_size) {
/* store as interm. result */
glob_seq_size = seq_size;
glob_pos_from = track_pos[ind];
glob_pos_to = i;
}
} else {
track_pos[ind] = i;
}
if (a[i-1])
ind++;
else
ind--;
}
output:
glob_seq_size - length of maximum sequence
glob_pos_from - start position of max sequence
glob_pos_to - end position of max sequence
In this thread ( http://discuss.techinterview.org/default.asp?interview.11.792102.31 ), poster A.F. has given an algorithm that runs in O(n) time and uses O(sqrt(n log n)) bits.
brute force: start with maximum length of the array to count the o's and l's. if o eqals l, you are finished. else reduce search length by 1 and do the algorithm for all subsequences of the reduced length (that is maximium length minus reduced length) and so on. stop when the subtraction is 0.
As was pointed out by user "R..", there is no solution, strictly speaking, unless you ignore the "log n" space complexity. In the following, I will consider that the array length fits in a machine register (e.g. a 64-bit word) and that a machine register has size O(1).
The important point to notice is that if there are more 1's than 0's, then the maximum subsequence that you are looking for necessarily includes all the 0's, and that many 1's. So here the algorithm:
Notations: the array has length n, indices are counted from 0 to n-1.
First pass: count the number of 1's (c1) and 0's (c0). If c1 = c0 then your maximal subsequence is the entire array (end of algorithm). Otherwise, let d be the digit which appears the less often (d = 0 if c0 < c1, otherwise d = 1).
Compute m = min(c0, c1) * 2. This is the size of the subsequence you are looking for.
Second pass: scan the array to find the index j of the first occurrence of d.
Compute k = max(j, n - m). The subsequence starts at index k and has length m.
Note that there could be several solutions (several subsequences of maximal length which match the criterion).
In plain words: assuming that there are more 1's than 0's, then I consider the smallest subsequence which contains all the 0's. By definition, that subsequence is surrounded by bunches of 1's. So I just grab enough 1's from the sides.
Edit: as was pointed out, this does not work... The "important point" is actually wrong.
Try something like this:
/* bit(n) is a macro that returns the nth bit, 0 or 1. len is number of bits */
int c[2] = {0,0};
int d, i, a, b, p;
for(i=0; i<len; i++) c[bit(i)]++;
d = c[1] < c[0];
if (c[d] == 0) return; /* all bits identical; fail */
for(i=0; bit(i)!=d; i++);
a = b = i;
for(p=0; i<len; i++) {
p += 2*bit(i)-1;
if (!p) b = i;
}
if (a == b) { /* account for case where we need bits before the first d */
b = len - 1;
a -= abs(p);
}
printf("maximal subsequence consists of bits %d through %d\n", a, b);
Completely untested but modulo stupid mistakes it should work. Based on my reply to Thomas's answer which failed in certain cases.
New Solution:
Space complexity of O(1) and time complexity O(n^2)
int iStart = 0, iEnd = 0;
int[] arrInput = { 1, 0, 1, 1, 1,0,0,1,0,1,0,0 };
for (int i = 0; i < arrInput.Length; i++)
{
int iCurrEndIndex = i;
int iSum = 0;
for (int j = i; j < arrInput.Length; j++)
{
iSum = (arrInput[j] == 1) ? iSum+1 : iSum-1;
if (iSum == 0)
{
iCurrEndIndex = j;
}
}
if ((iEnd - iStart) < (iCurrEndIndex - i))
{
iEnd = iCurrEndIndex;
iStart = i;
}
}
I am not sure whether the array you are referring is int array of 0's and 1's or bitarray??
If its about bitarray, here is my approach:
int isEvenBitCount(int n)
{
//n ... //Decimal equivalent of the input binary sequence
int cnt1 = 0, cnt0 = 0;
while(n){
if(n&0x01) { printf("1 "); cnt1++;}
else { printf("0 "); cnt0++; }
n = n>>1;
}
printf("\n");
return cnt0 == cnt1;
}
int main()
{
int i = 40, j = 25, k = 35;
isEvenBitCount(i)?printf("-->Yes\n"):printf("-->No\n");
isEvenBitCount(j)?printf("-->Yes\n"):printf("-->No\n");
isEvenBitCount(k)?printf("-->Yes\n"):printf("-->No\n");
}
with use of bitwise operations the time complexity is almost O(1) also.

Resources