binary search nearest match with last occurrence - c

I am implementing effective algorithm to search last occurrence of ( the key or nearest match (upper bound)).
So far, i got this.
long bin_search_closest_match_last_occurance ( long * lArray, long sizeArray, long lnumber)
{
long left, right, mid, last_occur;
left = 0;
right = sizeArray - 1;
last_occur = -1;
while ( left <= right )
{
mid = ( left + right ) / 2;
if ( lArray[mid] == lnumber )
{
last_occur = mid;
left = mid +1;
}
if ( lArray[mid] > lnumber )
right = mid - 1;
else
left = mid + 1;
}
return last_occur!=-1?last_occur:mid;
}
Let's have an array {0,0,1,5,9,9,9,9} and the key is 6
Fce should return index 7, but my fce returns 4
Please note, that i do not want to iterate linearly to the last matching index.
In mind i have solution where i change parameters fce(add start,end indexes) and do another binary search withing fce from found upper bound to the end of the array (Only if i dont find exact match, last_occur==-1).
I want to ask if there's better/cleaner solution to implement it?

n.m.'s 2-search approach will work, and it keeps the optimal time complexity, but it's likely to increase the constant factor by around 2, or by around 1.5 if you begin the second search from where the first search ended.
If instead you take an "ordinary" binary search that finds the first instance of lnumber (or, if it doesn't exist, a lower bound), and change it so that the algorithm logically "reverses" the array by changing every array access lArray[x] to lArray[sizeArray - 1 - x] (for any expression x), and also "reverse" the ordering by changing the > lnumber test to < lnumber, then only a single binary search is needed. The only array accesses this algorithm actually performs are two lookups to lArray[mid], which an optimising compiler is very likely to evaluate only once if it can prove that nothing will change the value in between the accesses (this might require adding restrict to the declaration of long * lArray; alternatively, you could just load the element into a local variable and test it twice instead). Either way, if only a single array lookup per iteration is needed, then changing the index from mid to sizeArray - 1 - mid will add just 2 extra subtractions per iteration (or just 1 if you --sizeArray before entering the loop), which I expect will not increase the constant nearly as much as n.m.'s approach. Of course, as with anything, if performance is critical then test it; and if it's not, then don't worry too much about saving microseconds.
You will also need to "reverse" the return value too:
return last_occur!=-1?last_occur:sizeArray - 1 - mid;

Related

Time complexity of finding index equal to array (sorted) value

This kind of recursion is similar to a binary search, but I'm not sure how to exactly solve the recursion using back substitution.
To find the index where it is equal to the array value (in a sorted array), the code would basically look like this:
find(array, low, high) {
if high < low
return -1
mid = (low + high) / 2
midval = array[mid]
if midval == mid
return mid
int left = find(array, low, min - 1)
if left >= 0
return left
int right = find(array, mid + 1, high)
return right
}
So the recurrence relation would look like this:
T(1) = b
T(n) = 2T(n/2) + c
= 4T(n/4) + c(1+2)
= 8T(n/8) + c(1+2+4)
= 16(n/16) + c(1+2+4+8)
= 2^k T(n/2^k) + (2^k - 1)c
= 2^(logn)T(1) + (2^(logn) - 1)c
= 2^(logn)(1+c) - c
I know the time complexity is suppose to be like O(logn) or O(nlogn), but I'm not sure how to get there from this using back subtitution.
With a sorted array finding an element with a naive implementation has at worst O(n). Hence, a better approach would have a worst-case complexity lower than O(n), so it cannot be O(n logn).
In a typically binary search, one takes advantage of the array being sorted and therefore one does not need to search in both sub-trees for each recursive call. One either goes left or right on the array. So instead of T(n) = 2T(n/2) + c one would have T(n) = T(n/2) + c.
Now your problem is different from a binary search, because you want to find a position on an array that matches its indices value. So, unlike the binary search in this context you might have to go both right and left as well in some recursive calls.
So in your case the worst case scenario is actually O(N), since 2^(log2N) is N as you can see here. Unless, there is a super clever way of improving your code, I would just go for a normal search, simpler and more readable code for a worst-case scenario of O(N) as well.
You search from the beginning of the array if the value x matches the index you return that value. Otherwise, if x > the current index, you can jump to the next index equals to the value x (i.e., array[x]), thus you skip array position that based on the fact that the array is sorted will not have an index matching its value.

Magic Array Index Time/Space Complexity

I've been looking at the following problem:
Magic Index: A magic index in an array A[0...n-1] is defined to be an index i such as A[i] = i. Given a sorted non-distinct array of integers, write a method to find a magic index if one exists.
Here is my solution:
static int magicNonDistinct(int[] array, int start, int end) {
if (end < start) return -1;
int mid = start + (end - start) / 2;
if (mid < 0 || mid >= array.length) return -1;
int v = array[mid];
if (v == mid) return mid;
int leftEnd = Math.min(v, mid - 1);
int leftRes = magicNonDistinct(array, start, leftEnd);
if (leftRes != -1) return leftRes;
int rightStart = Math.max(v, mid + 1);
int rightRes = magicNonDistinct(array, rightStart, end);
return rightRes;
}
It works just fine and is the recommended solution from the book Cracking The Code Interview 6th Edition, problem 8.3 Follow up (sorry for spoiling).
However when running this on a distinct array with no magic index, it visits all the elements, yielding a worst case running time of O(n).
Since it is recursive it takes O(n) memory as worst case.
Why would this solution be preferable to just iterating over the array? This solution (my own) is better I would argue:
static int magicNonDistinctV2(int[] array) {
for (int i = 0; i < array.length; ++i) {
int v = array[i];
if (v == i) return v;
if (v >= array.length) return -1;
else if (v > i) i = v - 1;
}
return -1;
}
O(n) running time O(1) space always?
Could somebody derive a better time complexity for the initial algorithm? I've been thinking about looking if it is O(d), where d is the number of distinct elements, however that case is also wrong since the min/max only works in one direction (think about if v = 5, mid = 4 and the lower part of the array is all fives).
EDIT:
Ok people think I'm bananas and scream O(log(n)) as soon as they see something that looks like binary search. Sorry for being unclear folks.
Let's talk about the code in the first posting I made (the solution by CTCI):
If we have an array looking like this: [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8], actually an array looking like this: [-1,...,n-2] of size n, we know that there is not element that can match. However - the algorithm will visit all elements since the elements aren't unique. I dare you, run it, it can not divide the search space by 2 as in a regular binary search. Please tell me what is wrong with my reasoning.
No, in my opinion the first solution is not O(log n) as other answers state, it is really O(n) worst case (in the worst case it still needs to go through all the elements, consider equivalence array shifted by one as also mentioned by the author).
The cause why it is not O(log n) is because it needs to search on both sides of the middle (binary search only checks one side of middle therefore it is O(log n)).
It allows to skip items if you're lucky, however your second iterative solution skips items too if not needed to look on them (because you know there cannot be magic index in such range as the array is sorted) so in my opinion the second solution is better (the same complexity + iterative i.e. better space complexity and no recursive calls which are relatively expensive).
EDIT: However when I thought about the first solution again, it on the other side allows to also "skip backwards" if possible, which the iterative solution does not allow - consider for example an array like { -10, -9, -8, -7, -6, -5 } - the iterative solution would need to check all the elements, because it starts at the beginning and the values do not allow to skip forward, whereas when starting from the middle, the algo can completely skip checking the first half, then the first half of the second half, etc.
You are correct, the worst case complexity is O(n). You may have to visit all the elements of your array.
There is only one reason to not visit the array elements [mid, end] and that is when array[mid] > end (because in that case, the magic index is surely absent from [mid, end] elements).
Similarly, there is only one reason to not visit the array elements [start, mid] and that is when array[start] > mid.
So, there is a hope that you may not have to visit all the elements. Therefore it is one optimization which may work.
Thus, this binary-like method seems better than iterating over the entire array linearly but in worst case, you will hit O(n).
PS: I've assumed that array is sorted in ascending order.
It looks like you misunderstood the time complexity the required solution. The worse case is not O(n), it is O(log(n)). This is because during each pass you search next time only half of the array.
Here is a C++ example and check that for the whole array of 11 elements, it take only 3 checks.

Efficiently choose an integer distinct from all elements of a list

I have a linked list of objects each containing a 32-bit integer (and provably fewer than 232 such objects) and I want to efficiently choose an integer that's not present in the list, without using any additional storage (so copying them to an array, sorting the array, and choosing the minimum value not in the array would not be an option). However, the definition of the structure for list elements is under my control, so I could add (within reason) additional storage to each element as part of solving the problem. For example, I could add an extra set of prev/next pointers and merge-sort the list. Is this the best solution? Or is there a simpler or more efficient way to do it?
Given the conditions that you outline in the comments, especially your expectation of many identical values, you must expect a sparse distribution of used values.
Consequently, it might actually be best to just guess a value randomly and then check whether it coincides with a value in the list. Even if half the available value range were used (which seems extremely unlikely from your comments), you would only traverse the list twice on average. And you can drastically decrease this factor by simultaneously checking a number of guesses in one pass. Done correctly, the factor should always be close to one.
The advantage of such a probabilistic approach is that you are immune to bad sequences of values. Such sequences are always possible with range based approaches: If you calculate the min and max of the data, you run the risk, that the data contains both 0 and 2^32-1. If you sequentially subdivide an interval, you run the risk of always getting values in the middle of the interval, which can shrink it to zero in 32 steps. With a probabilistic approach, these sequences can't hurt you.
I think, I would use something like four guesses for very small lists, and crank it up to roughly 16 as the size of the list approaches the limit. The high starting value is due to the fact that any such algorithm will be memory bound, i. e. your CPU has ample amounts of time to check a value while it waits for the next values to arrive from memory, so you better make good use of that time to reduce the number of passes required.
A further optimization would instantly replace a busted guess with a new one and keep track of where the replacement happened, so that you can avoid a complete second pass through the data. Also, move the busted guess to the end of the list of guesses, so that you only need to check against the start position of the first guess in your loop to stop as early as possible.
If you can spare one pointer in each object, you get an O(n) worst-case algorithm easily (standard divide-and-conquer):
Divide the range of possible IDs equally.
Make a singly-linked list covering each subrange.
If one subrange is empty, choose any id in it.
Otherwise repeat with the elements of the subrange with fewest elements.
Example code using two sub-ranges per iteration:
unsigned getunusedid(element* h) {
unsigned start = 0, stop = -1;
for(;h;h = h->mainnext)
h->next = h->mainnext;
while(h) {
element *l = 0, *r = 0;
unsigned cl = 0, cr = 0;
unsigned mid = start + (stop - start) / 2;
while(h) {
element* next = h->next;
if(h->id < mid) {
h->next = l;
cl++;
l = h;
} else {
h->next = r;
cr++;
r = h;
}
h = next;
}
if(cl < cr) {
h = l;
stop = mid - 1;
} else {
h = r;
start = mid;
}
}
return start;
}
Some more remarks:
Beware of bugs in the above code; I have only proved it correct, not tried it.
Using more buckets (best keep to a power of 2 for easy and efficient handling) each iteration might be faster due to better data-locality (though only try and measure if it's not fast enough otherwise), as #MarkDickson rightly remarks.
Without those extra-pointers, you need full sweeps each iteration, raising the bound to O(n*lg n).
An alternative would be using 2+ extra-pointers per element to maintain a balanced tree. That would speed up id-search, at the expense of some memory and insertion/removal time overhead.
If you don't mind an O(n) scan for each change in the list and two extra bits per element, whenever an element is inserted or removed, scan through and use the two bits to represent whether an integer (element + 1) or (element - 1) exists in the list.
For example, inserting the element, 2, the extra bits for each 3 and 1 in the list would be updated to show that 3-1 (in the case of 3) and 1+1 (in the case of 1) now exist in the list.
Insertion/deletion time can be reduced by adding a pointer from each element to the next element with the same integer.
I am supposing that integers have random values not controlled by your code.
Add two unsigned integers in your list class:
unsigned int rangeMinId = 0;
unsigned int rangeMaxId = 0xFFFFFFFF ;
Or if not possible to change the List class add them as global variables.
When the list is empty you will always know that the range if free. When you add a new item in the list check if its ID is between rangeMinId and rangeMaxId and if so change the nearest of them to this ID.
It may happen after a lot of time that rangeMinId to become equal to rangeMaxId-1, then you need a simple function which traverses the whole list and search for another free range. But this will not happens very frequently.
Other solutions are more complex and involves using of sets, binary trees or sorted arrays.
Update:
The free range search function can be done in O(n*log(n)). An example of such function is given below(I have not extensively tested it). The example is for integer array but easily can be adapted for a list.
int g_Calls = 0;
bool _findFreeRange(const int* value, int n, int& left, int& right)
{
g_Calls ++ ;
int l=left, r=right,l2,r2;
int m = (right + left) / 2 ;
int nl=0, nr=0;
for(int k = 0; k < n; k++)
{
const int& i = value[k] ;
if(i > l && i < r)
{
if(i-l < r-i)
l = i;
else
r = i;
}
if(i < m)
nl ++ ;
else
nr ++ ;
}
if ( (r - l) > 1 )
{
left = l;
right = r;
return true ;
}
if( nl < nr)
{
// check first left then right
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
else
{
// check first right then left
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
return false;
}
bool findFreeRange(const int* value, int n, int& left, int& right, int maxx)
{
g_Calls = 1;
left = 0;
right = maxx;
if(!_findFreeRange(value, n, left, right))
return false ;
left++;
right--;
return (right - left) >= 0 ;
}
If it returns false list is filled and there is no free range (very least possible), maxm is the maximal limit of the range in this case 0xFFFFFFFF.
The idea is first to search the biggest range of the list and then if no free hole is found to recursively search the subranges for holes which may have been left during the first pass. If the list is sparsely filled it is very least probable that function will be called more than once. However when the list become almost completely filled it can happen the range search to take longer. Thus in this most worst case scenario, when the list becomes closed to filled, its better to start keeping all free ranges in a list.
This reminds me of the book Programming Pearls, and in particular the very first column, "Cracking the Oyster". What is the real problem you are trying to solve?
If your list is small, then a simple linear search to find max/min would work and it would work quickly.
When your list gets large and linear search becomes unwieldy, you can build a bitmap to represent the unused numbers for much less memory than adding 2 extra pointers at each node in the linked list. In fact, it would only be 2^(32-8) = 16KB of RAM compared to your linked list being potentially >10GB.
Then to find an unused number, you can just traverse the bitmap one machine-word at a time, checking if it's non-zero. If it is, then at least one number in that 32- or 64- bit block is unused, and you can inspect the word to find out exactly which bit is set. As you add numbers to the list, all you have to do is clear the corresponding bit in the bitmap.
One possible solution is to take the min and max of the list with a simple O(n) iteration, then pick a number between max and min + (1 << 32). This is simple to do since overflow/underflow behavior is well-defined for unsigned integers:
uint32_t min, max;
// TODO: compute min and max here
// exclude max from choice space (min will be an exclusive upper bound)
max++;
uint32_t choice = rand32() % (min - max) + max; // where rand32 is a random unsigned 32-bit integer
Of course, if it doesn't need to be random, then you can just use one more than the maximum of the list.
Note: the only case where this fails is if min is 0 and max is UINT32_MAX (aka 4294967295).
Ok. Here is one really simple solution. Some of the answers have become too theoretical and complicated for optimization. If you need a quick solution do this:
1.In your List add a member:
unsigned int NextFreeId = 1;
add also an std::set<unsigned int> ids
When you add item in the list add also the integer in the set and keep track of the NextFreeId:
int insert(unsigned int id)
{
ids.insert(id);
if (NextFreeId == id) //will not happen too frequently
{
unsigned int TheFreeId ;
unsigned int nextid = id+1, previd = id-1;
while(true )
{
if(nextid < 0xFFFFFFF && !ids.count(nextid))
{
NextFreeId = nextid ;
break ;
}
if(previd > 0 && !ids.count(previd))
{
NextFreeId = previd ;
break ;
}
if(prevId == 0 && nextid == 0xFFFFFFF)
break; // all the range is filled, there is no free id
nextid++ ;
previd -- ;
}
}
return 1;
}
Sets are very efficient to check if a value is contained so the complexity will be O(log(N)). It is quick to implement. Also set is searched not each time but only when the NextFreeId is filled. List is not traversed at all.

Is it possible to have only one comparison per iteration of a binary search algorithm?

In binary search algorithm we have two comparisons:
if (key == a[mid]) then found;
else if (key < a[mid]) then binary_search(a[],left,mid-1);
else binary_search(a[],mid+1,right);
Is there a way by which I can have only one comparison instead of the above two.
--
Thanks
Alok.Kr.
See:
http://en.wikipedia.org/wiki/Binary_search_algorithm#Single_comparison_per_iteration
Taken from wiki:
low = 0
high = N
while (low < high) {
mid = low + ((high - low) / 2)
if (A[mid] < value)
low = mid + 1;
else
//can't be high = mid-1: here A[mid] >= value,
//so high can't be < mid if A[mid] == value
high = mid;
}
// high == low, using high or low depends on taste
if ((low < N) && (A[low] == value))
return low // found
else
return -1 // not found
Pros/cons from wiki:
"This approach foregoes the possibility of early termination on discovery of a match, thus successful searches have log2(N) iterations instead of an expected log2(N) − 1 iterations. On the other hand, this implementation makes fewer comparisons: log2(N) is less than the expected number of comparisons for the two-test implementations of 1·5(log2(N) − 1), for N greater than eight."
Yes. Just don't eliminate mid from the recursive call.
if ( left == right ) return NULL;
if ( left + 1 == right ) return key == a[left]? &a[left] : NULL;
mid = left + ( right - left / 2 );
if (key < a[mid]) return binary_search(a[],left,mid-1);
else return binary_search(a[],mid,right); // include `mid` in next round
You only need to eliminate half of the set with each recursion to achieve O(logN) performance. You're going above and beyond by eliminating half+1.
If you only use < during recursion, the algorithm will find the least element which is not less than key (but may be greater than key). Finish off by performing a single equality test.
In assembler, you could:
cmp key,a[mid]
beq found
bge else
So if your compiler is really good at peephole optimizations, it might already do this for you.
This is recursive algorithm. First comparison is stop criteria and second actual search, so you cannot remove them.
In first you asking whenever you have already found the element and in second in which part of the array to look for element. So you cannot make those decisions based only on one comparison.
First things first: do you need to optimize the program? Have you measured to know where you need to do it? Is it in this function?
For primitive types the second comparison is as fast an operation as it gets. The higher cost of the comparison is loading the element into the appropriate register, and that is needed for the first comparison. Once that comparison is executed, the value is already in a register and the second operation takes a single processor instruction plus the possible cost of the branch misprediction.
Assuming integral types, the cost in processor time of the algorithm is most probably dominated by the cost of the recursive calls if the compiler is not being able to perform tail-recursion optimization. If you really need to optimize this, try compiling with all the optimization flags on and analyze the assembler to identify whether the tail-recursion optimization is being applied. If not, manually convert the algorithm from recursive to iterative.
This will have two effects: obscure the code (avoid modifying a clean solution unless you really need to) and it avoid function calls.
If you are speaking of C++, and the type is complex and the overloaded comparison operators are expensive, the fastest boost in performance is implementing a compare method that will return a negative number for less-than, 0 for equal, and a positive number if greater-than. Then precompute the result before the comparisons and then perform integer only checks. That will remove the overall cost of the algorithm to a single processing of the real objects with the expensive comparison and set you back in the original assumption.
for (step = 0; step < n; step <<= 1);
for (i = 0; step; step >>= 1)
if (i + step < n && v[i + step] <= x)
i += step;
Ok, this was a interview question in Adobe, and I was just trying to figure out how to do this.
Now I have got solution to it, so I,m posting
void binary_search (int a[] , int low , int high , int key )
{
int mid = (low+high)/2;
if (key == a[mid]) {
printf ("Number Found\n");
return;
}
else {
int sign = Calc_sign (key-a[mid]);
low = low*sign + (1-sign)*mid;
high = mid*sign + (1-sign)*right;
binary_search (a,low,high,key);
}
}
int Calc_sign(int a)
{
return ( (a & 80000000) >> 31);
}
So in the code there will only be one comparison for checking if the keyvalue is eqaul to the mid element.
--
Thanks
Alok Kr.

How can I find a number which occurs an odd number of times in a SORTED array in O(n) time?

I have a question and I tried to think over it again and again... but got nothing so posting the question here. Maybe I could get some view-point of others, to try and make it work...
The question is: we are given a SORTED array, which consists of a collection of values occurring an EVEN number of times, except one, which occurs ODD number of times. We need to find the solution in log n time.
It is easy to find the solution in O(n) time, but it looks pretty tricky to perform in log n time.
Theorem: Every deterministic algorithm for this problem probes Ω(log2 n) memory locations in the worst case.
Proof (completely rewritten in a more formal style):
Let k > 0 be an odd integer and let n = k2. We describe an adversary that forces (log2 (k + 1))2 = Ω(log2 n) probes.
We call the maximal subsequences of identical elements groups. The adversary's possible inputs consist of k length-k segments x1 x2 … xk. For each segment xj, there exists an integer bj ∈ [0, k] such that xj consists of bj copies of j - 1 followed by k - bj copies of j. Each group overlaps at most two segments, and each segment overlaps at most two groups.
Group boundaries
| | | | |
0 0 1 1 1 2 2 3 3
| | | |
Segment boundaries
Wherever there is an increase of two, we assume a double boundary by convention.
Group boundaries
| || | |
0 0 0 2 2 2 2 3 3
Claim: The location of the jth group boundary (1 ≤ j ≤ k) is uniquely determined by the segment xj.
Proof: It's just after the ((j - 1) k + bj)th memory location, and xj uniquely determines bj. //
We say that the algorithm has observed the jth group boundary in case the results of its probes of xj uniquely determine xj. By convention, the beginning and the end of the input are always observed. It is possible for the algorithm to uniquely determine the location of a group boundary without observing it.
Group boundaries
| X | | |
0 0 ? 1 2 2 3 3 3
| | | |
Segment boundaries
Given only 0 0 ?, the algorithm cannot tell for sure whether ? is a 0 or a 1. In context, however, ? must be a 1, as otherwise there would be three odd groups, and the group boundary at X can be inferred. These inferences could be problematic for the adversary, but it turns out that they can be made only after the group boundary in question is "irrelevant".
Claim: At any given point during the algorithm's execution, consider the set of group boundaries that it has observed. Exactly one consecutive pair is at odd distance, and the odd group lies between them.
Proof: Every other consecutive pair bounds only even groups. //
Define the odd-length subsequence bounded by the special consecutive pair to be the relevant subsequence.
Claim: No group boundary in the interior of the relevant subsequence is uniquely determined. If there is at least one such boundary, then the identity of the odd group is not uniquely determined.
Proof: Without loss of generality, assume that each memory location not in the relevant subsequence has been probed and that each segment contained in the relevant subsequence has exactly one location that has not been probed. Suppose that the jth group boundary (call it B) lies in the interior of the relevant subsequence. By hypothesis, the probes to xj determine B's location up to two consecutive possibilities. We call the one at odd distance from the left observed boundary odd-left and the other odd-right. For both possibilities, we work left to right and fix the location of every remaining interior group boundary so that the group to its left is even. (We can do this because they each have two consecutive possibilities as well.) If B is at odd-left, then the group to its left is the unique odd group. If B is at odd-right, then the last group in the relevant subsequence is the unique odd group. Both are valid inputs, so the algorithm has uniquely determined neither the location of B nor the odd group. //
Example:
Observed group boundaries; relevant subsequence marked by […]
[ ] |
0 0 Y 1 1 Z 2 3 3
| | | |
Segment boundaries
Possibility #1: Y=0, Z=2
Possibility #2: Y=1, Z=2
Possibility #3: Y=1, Z=1
As a consequence of this claim, the algorithm, regardless of how it works, must narrow the relevant subsequence to one group. By definition, it therefore must observe some group boundaries. The adversary now has the simple task of keeping open as many possibilities as it can.
At any given point during the algorithm's execution, the adversary is internally committed to one possibility for each memory location outside of the relevant subsequence. At the beginning, the relevant subsequence is the entire input, so there are no initial commitments. Whenever the algorithm probes an uncommitted location of xj, the adversary must commit to one of two values: j - 1, or j. If it can avoid letting the jth boundary be observed, it chooses a value that leaves at least half of the remaining possibilities (with respect to observation). Otherwise, it chooses so as to keep at least half of the groups in the relevant interval and commits values for the others.
In this way, the adversary forces the algorithm to observe at least log2 (k + 1) group boundaries, and in observing the jth group boundary, the algorithm is forced to make at least log2 (k + 1) probes.
Extensions:
This result extends straightforwardly to randomized algorithms by randomizing the input, replacing "at best halved" (from the algorithm's point of view) with "at best halved in expectation", and applying standard concentration inequalities.
It also extends to the case where no group can be larger than s copies; in this case the lower bound is Ω(log n log s).
A sorted array suggests a binary search. We have to redefine equality and comparison. Equality simple means an odd number of elements. We can do comparison by observing the index of the first or last element of the group. The first element will be an even index (0-based) before the odd group, and an odd index after the odd group. We can find the first and last elements of a group using binary search. The total cost is O((log N)²).
PROOF OF O((log N)²)
T(2) = 1 //to make the summation nice
T(N) = log(N) + T(N/2) //log(N) is finding the first/last elements
For some N=2^k,
T(2^k) = (log 2^k) + T(2^(k-1))
= (log 2^k) + (log 2^(k-1)) + T(2^(k-2))
= (log 2^k) + (log 2^(k-1)) + (log 2^(k-2)) + ... + (log 2^2) + 1
= k + (k-1) + (k-2) + ... + 1
= k(k+1)/2
= (k² + k)/2
= (log(N)² + log(N))/ 2
= O(log(N)²)
Look at the middle element of the array. With a couple of appropriate binary searches, you can find the first and its last appearance in the array. E.g., if the middle element is 'a', you need to find i and j as shown below:
[* * * * a a a a * * *]
^ ^
| |
| |
i j
Is j - i an even number? You are done! Otherwise (and this is the key here), the question to ask is i an even or an odd number? Do you see what this piece of knowledge implies? Then the rest is easy.
This answer is in support of the answer posted by "throwawayacct". He deserves the bounty. I spent some time on this question and I'm totally convinced that his proof is correct that you need Ω(log(n)^2) queries to find the number that occurs an odd number of times. I'm convinced because I ended up recreating the exact same argument after only skimming his solution.
In the solution, an adversary creates an input to make life hard for the algorithm, but also simple for a human analyzer. The input consists of k pages that each have k entries. The total number of entries is n = k^2, and it is important that O(log(k)) = O(log(n)) and Ω(log(k)) = Ω(log(n)). To make the input, the adversary makes a string of length k of the form 00...011...1, with the transition in an arbitrary position. Then each symbol in the string is expanded into a page of length k of the form aa...abb...b, where on the ith page, a=i and b=i+1. The transition on each page is also in an arbitrary position, except that the parity agrees with the symbol that the page was expanded from.
It is important to understand the "adversary method" of analyzing an algorithm's worst case. The adversary answers queries about the algorithm's input, without committing to future answers. The answers have to be consistent, and the game is over when the adversary has been pinned down enough for the algorithm to reach a conclusion.
With that background, here are some observations:
1) If you want to learn the parity of a transition in a page by making queries in that page, you have to learn the exact position of the transition and you need Ω(log(k)) queries. Any collection of queries restricts the transition point to an interval, and any interval of length more than 1 has both parities. The most efficient search for the transition in that page is a binary search.
2) The most subtle and most important point: There are two ways to determine the parity of a transition inside a specific page. You can either make enough queries in that page to find the transition, or you can infer the parity if you find the same parity in both an earlier and a later page. There is no escape from this either-or. Any set of queries restricts the transition point in each page to some interval. The only restriction on parities comes from intervals of length 1. Otherwise the transition points are free to wiggle to have any consistent parities.
3) In the adversary method, there are no lucky strikes. For instance, suppose that your first query in some page is toward one end instead of in the middle. Since the adversary hasn't committed to an answer, he's free to put the transition on the long side.
4) The end result is that you are forced to directly probe the parities in Ω(log(k)) pages, and the work for each of these subproblems is also Ω(log(k)).
5) Things are not much better with random choices than with adversarial choices. The math is more complicated, because now you can get partial statistical information, rather than a strict yes you know a parity or no you don't know it. But it makes little difference. For instance, you can give each page length k^2, so that with high probability, the first log(k) queries in each page tell you almost nothing about the parity in that page. The adversary can make random choices at the beginning and it still works.
Start at the middle of the array and walk backward until you get to a value that's different from the one at the center. Check whether the number above that boundary is at an odd or even index. If it's odd, then the number occurring an odd number of times is to the left, so repeat your search between the beginning and the boundary you found. If it's even, then the number occurring an odd number of times must be later in the array, so repeat the search in the right half.
As stated, this has both a logarithmic and a linear component. If you want to keep the whole thing logarithmic, instead of just walking backward through the array to a different value, you want to use a binary search instead. Unless you expect many repetitions of the same numbers, the binary search may not be worthwhile though.
I have an algorithm which works in log(N/C)*log(K), where K is the length of maximum same-value range, and C is the length of range being searched for.
The main difference of this algorithm from most posted before is that it takes advantage of the case where all same-value ranges are short. It finds boundaries not by binary-searching the entire array, but by first quickly finding a rough estimate by jumping back by 1, 2, 4, 8, ... (log(K) iterations) steps, and then binary-searching the resulting range (log(K) again).
The algorithm is as follows (written in C#):
// Finds the start of the range of equal numbers containing the index "index",
// which is assumed to be inside the array
//
// Complexity is O(log(K)) with K being the length of range
static int findRangeStart (int[] arr, int index)
{
int candidate = index;
int value = arr[index];
int step = 1;
// find the boundary for binary search:
while(candidate>=0 && arr[candidate] == value)
{
candidate -= step;
step *= 2;
}
// binary search:
int a = Math.Max(0,candidate);
int b = candidate+step/2;
while(a+1!=b)
{
int c = (a+b)/2;
if(arr[c] == value)
b = c;
else
a = c;
}
return b;
}
// Finds the index after the only "odd" range of equal numbers in the array.
// The result should be in the range (start; end]
// The "end" is considered to always be the end of some equal number range.
static int search(int[] arr, int start, int end)
{
if(arr[start] == arr[end-1])
return end;
int middle = (start+end)/2;
int rangeStart = findRangeStart(arr,middle);
if((rangeStart & 1) == 0)
return search(arr, middle, end);
return search(arr, start, rangeStart);
}
// Finds the index after the only "odd" range of equal numbers in the array
static int search(int[] arr)
{
return search(arr, 0, arr.Length);
}
Take the middle element e. Use binary search to find the first and last occurrence. O(log(n))
If it is odd return e.
Otherwise, recurse onto the side that has an odd number of elements [....]eeee[....]
Runtime will be log(n) + log(n/2) + log(n/4).... = O(log(n)^2).
AHhh. There is an answer.
Do a binary search and as you search, for each value, move backwards until you find the first entry with that same value. If its index is even, it is before the oddball, so move to the right.
If its array index is odd, it is after the oddball, so move to the left.
In pseudocode (this is the general idea, not tested...):
private static int FindOddBall(int[] ary)
{
int l = 0,
r = ary.Length - 1;
int n = (l+r)/2;
while (r > l+2)
{
n = (l + r) / 2;
while (ary[n] == ary[n-1])
n = FindBreakIndex(ary, l, n);
if (n % 2 == 0) // even index we are on or to the left of the oddball
l = n;
else // odd index we are to the right of the oddball
r = n-1;
}
return ary[l];
}
private static int FindBreakIndex(int[] ary, int l, int n)
{
var t = ary[n];
var r = n;
while(ary[n] != t || ary[n] == ary[n-1])
if(ary[n] == t)
{
r = n;
n = (l + r)/2;
}
else
{
l = n;
n = (l + r)/2;
}
return n;
}
You can use this algorithm:
int GetSpecialOne(int[] array, int length)
{
int specialOne = array[0];
for(int i=1; i < length; i++)
{
specialOne ^= array[i];
}
return specialOne;
}
Solved with the help of a similar question which can be found here on http://www.technicalinterviewquestions.net
We don't have any information about the distribution of lenghts inside the array, and of the array as a whole, right?
So the arraylength might be 1, 11, 101, 1001 or something, 1 at least with no upper bound, and must contain at least 1 type of elements ('number') up to (length-1)/2 + 1 elements, for total sizes of 1, 11, 101: 1, 1 to 6, 1 to 51 elements and so on.
Shall we assume every possible size of equal probability? This would lead to a middle length of subarrays of size/4, wouldn't it?
An array of size 5 could be divided into 1, 2 or 3 sublists.
What seems to be obvious is not that obvious, if we go into details.
An array of size 5 can be 'divided' into one sublist in just one way, with arguable right to call it 'dividing'. It's just a list of 5 elements (aaaaa). To avoid confusion let's assume the elements inside the list to be ordered characters, not numbers (a,b,c, ...).
Divided into two sublist, they might be (1, 4), (2, 3), (3, 2), (4, 1). (abbbb, aabbb, aaabb, aaaab).
Now let's look back at the claim made before: Shall the 'division' (5) be assumed the same probability as those 4 divisions into 2 sublists? Or shall we mix them together, and assume every partition as evenly probable, (1/5)?
Or can we calculate the solution without knowing the probability of the length of the sublists?
The clue is you're looking for log(n). That's less than n.
Stepping through the entire array, one at a time? That's n. That's not going to work.
We know the first two indexes in the array (0 and 1) should be the same number. Same with 50 and 51, if the odd number in the array is after them.
So find the middle element in the array, compare it to the element right after it. If the change in numbers happens on the wrong index, we know the odd number in the array is before it; otherwise, it's after. With one set of comparisons, we figure out which half of the array the target is in.
Keep going from there.
Use a hash table
For each element E in the input set
if E is set in the hash table
increment it's value
else
set E in the hash table and initialize it to 0
For each key K in hash table
if K % 2 = 1
return K
As this algorithm is 2n it belongs to O(n)
Try this:
int getOddOccurrence(int ar[], int ar_size)
{
int i;
int xor = 0;
for (i=0; i < ar_size; i++)
xor = xor ^ ar[i];
return res;
}
XOR will cancel out everytime you XOR with the same number so 1^1=0 but 1^1^1=1 so every pair should cancel out leaving the odd number out.
Assume indexing start at 0. Binary search for the smallest even i such that x[i] != x[i+1]; your answer is x[i].
edit: due to public demand, here is the code
int f(int *x, int min, int max) {
int size = max;
min /= 2;
max /= 2;
while (min < max) {
int i = (min + max)/2;
if (i==0 || x[2*i-1] == x[2*i])
min = i+1;
else
max = i-1;
}
if (2*max == size || x[2*max] != x[2*max+1])
return x[2*max];
return x[2*min];
}

Resources