Related
We have an strictly increasing array of length n ( 1 < n < 500) . We sum the digits of each element to create a new array with each elements values is in range 1 to 500.The task is to rebuild the old array from the new one. since there might be more than one answer, we want the answers with the minimum value of the last element.
Example:
3 11 23 37 45 123 =>3 2 5 10 9 6
now from the second array, we can rebuild the original array in many different ways for instance:
12 20 23 37 54 60
from all the possible combinations, we need the one we minimum last element.
My Thoughts so far:
The brute force way is to find all possible permutations to create each number and then create all combinations possible of all numbers of the second array and find the combination with minimum last element. It is obvious that this is not a good choice.
Using this algorithm(with exponential time!) we can create all possible permutations of digits that sum to a number in the second arrays. Note that we know the original elements were less than 500 so we can limit the death of search of the algorithm.
One way I thought of that might find the answer faster is to:
start from the last element in the new arrays and find all possible
numbers that their digit sum resulted this element.
Then try to use the smallest amount in the last step for this element.
Now try to do the same with the second to last element. If the
minimum permutation value found for the second to last element is bigger
than the one found for the last element, backtrack to the last
element and try a larger permutation.
Do this until you get to the first element.
I think this is a greed solution but I'm not very sure about the time complexity. Also I want to know is there a better solution for this problem? like using dp?
For simplicity, let's have our sequence 1-based and the input sequence is called x.
We will also use an utility function, which returns the sum of the digits of a given number:
int sum(int x) {
int result = 0;
while (x > 0) {
result += x % 10;
x /= 10;
}
return result;
}
Let's assume that we are at index idx and try to set there some number called value (given that the sum of digits of value is x[idx]). If we do so, then what could we say about the previous number in the sequence? It should be strictly less than value.
So we already have a state for a potential dp approach - [idx, value], where idx is the index where we are currently at and value denotes the value we are trying to set on this index.
If the dp table holds boolean values, we will know we have found an answer if we have found a suitable number for the first number in the sequence. Therefore, if there is a path starting from the last row in the dp table and ends at row 0 then we'll know we have found an answer and we could then simply restore it.
Our recurrence function will be something like this:
f(idx, value) = OR {dp[idx - 1][value'], where sumOfDigits(value) = x[idx] and value' < value}
f(0, *) = true
Also, in order to restore the answer, we need to track the path. Once we set any dp[idx][value] cell to be true, then we can safe the value' to which we would like to jump in the previous table row.
Now let's code that one. I hope the code is self-explanatory:
boolean[][] dp = new boolean[n + 1][501];
int[][] prev = new int[n + 1][501];
for (int i = 0; i <= 500; i++) {
dp[0][i] = true;
}
for (int idx = 1; idx <= n; idx++) {
for (int value = 1; value <= 500; value++) {
if (sum(value) == x[idx]) {
for (int smaller = 0; smaller < value; smaller++) {
dp[idx][value] |= dp[idx - 1][smaller];
if (dp[idx][value]) {
prev[idx][value] = smaller;
break;
}
}
}
}
}
The prev table only keeps information about which is the smallest value', which we can use as previous to our idx in the resulting sequence.
Now, in order to restore the sequence, we can start from the last element. We would like it to be minimal, so we can find the first one that has dp[n][value] = true. Once we have such element, we then use the prev table to track down the values up to the first one:
int[] result = new int[n];
int idx = n - 1;
for (int i = 0; i <= 500; i++) {
if (dp[n][i]) {
int row = n, col = i;
while (row > 0) {
result[idx--] = col;
col = prev[row][col];
row--;
}
break;
}
}
for (int i = 0; i < n; i++) {
out.print(result[i]);
out.print(' ');
}
If we apply this on an input sequence:
3 2 5 10 9 6
we get
3 11 14 19 27 33
The time complexity is O(n * m * m), where n is the number of elements we have and m is the maximum possible value that an element could hold.
The space complexity is O(n * m) as this is dominated by the size of the dp and prev tables.
We can use a greedy algorithm: proceed through the array in order, setting each element to the least value that is greater than the previous element and has digits with the appropriate sum. (We can just iterate over the possible values and check the sums of their digits.) There's no need to consider any greater value than that, because increasing a given element will never make it possible to decrease a later element. So we don't need dynamic programming here.
We can calculate the sum of the digits of an integer m in O(log m) time, so the whole solution takes O(b log b) time, where b is the upper bound (500 in your example).
For example if the array is arr[] = {4, 2, 6, 1, 5},
and k = 3, then the output should be 4 2 1.
It can be done in O(nk) steps and O(1) space.
Firstly, find the kth smallest number in kn steps: find the minimum; store it in a local variable min; then find the second smallest number, i.e. the smallest number that is greater than min; store it in min; and so on... repeat the process from i = 1 to k (each time it's a linear search through the array).
Having this value, browse through the array and print all elements that are smaller or equal to min. This final step is linear.
Care has to be taken if there are duplicate values in the array. In such a case we have to increment i several times if duplicate min values are found in one pass. Additionally, besides min variable we have to have a count variable, which is reset to zero with each iteration of the main loop, and is incremented each time a duplicate min number is found.
In the final scan through the array, we print all values smaller than min, and up to count values exactly min.
The algorithm in C would like this:
int min = MIN_VALUE, local_min;
int count;
int i, j;
i = 0;
while (i < k) {
local_min = MAX_VALUE;
count = 0;
for (j = 0; j < n; j++) {
if ((arr[j] > min || min == MIN_VALUE) && arr[j] < local_min) {
local_min = arr[j];
count = 1;
}
else if ((arr[j] > min || min == MIN_VALUE) && arr[j] == local_min) {
count++;
}
}
min = local_min;
i += count;
}
if (i > k) {
count = count - (i - k);
}
for (i = 0, j = 0; i < n; i++) {
if (arr[i] < min) {
print arr[i];
}
else if (arr[i] == min && j < count) {
print arr[i];
j++;
}
}
where MIN_VALUE and MAX_VALUE can be some arbitrary values such as -infinity and +infinity, or MIN_VALUE = arr[0] and MAX_VALUE is set to be maximal value in arr (the max can be found in an additional initial loop).
Single pass solution - O(k) space (for O(1) space see below).
The order of the items is preserved (i.e. stable).
// Pseudo code
if ( arr.size <= k )
handle special case
array results[k]
int i = 0;
// init
for ( ; i < k, i++) { // or use memcpy()
results[i] = arr[i]
}
int max_val = max of results
for( ; i < arr.size; i++) {
if( arr[i] < max_val ) {
remove largest in results // move the remaining up / memmove()
add arr[i] at end of results // i.e. results[k-1] = arr[i]
max_val = new max of results
}
}
// for larger k you'd want some optimization to get the new max
// and maybe keep track of the position of max_val in the results array
Example:
4 6 2 3 1 5
4 6 2 // init
4 2 3 // remove 6, add 3 at end
2 3 1 // remove 4, add 1 at end
// or the original:
4 2 6 1 5
4 2 6 // init
4 2 1 // remove 6, add 1 -- if max is last, just replace
Optimization:
If a few extra bytes are allowed, you can optimize for larger k:
create an array size k of objects {value, position_in_list}
keep the items sorted on value:
new value: drop last element, insert the new at the right location
new max is the last element
sort the end result on position_in_list
for really large k use binary search to locate the insertion point
O(1) space:
If we're allowed to overwrite the data, the same algorithm can be used, but instead of using a separate array[k], use the first k elements of the list (and you can skip the init).
If the data has to be preserved, see my second answer with good performance for large k and O(1) space.
First find the Kth smallest number in the array.
Look at https://www.geeksforgeeks.org/kth-smallestlargest-element-unsorted-array-set-2-expected-linear-time/
Above link shows how you can use randomize quick select ,to find the kth smallest element in an average complexity of O(n) time.
Once you have the Kth smallest element,loop through the array and print all those elements which are equal to or less than Kth smallest number.
int small={Kth smallest number in the array}
for(int i=0;i<array.length;i++){
if(array[i]<=small){
System.out.println(array[i]+ " ");
}
}
A baseline (complexity at most 3n-2 for k=3):
find the min M1 from the end of the list and its position P1 (store it in out[2])
redo it from P1 to find M2 at P2 (store it in out[1])
redo it from P2 to find M3 (store it in out[0])
It can undoubtedly be improved.
Solution with O(1) space and large k (for example 100,000) with only a few passes through the list.
In my first answer I presented a single pass solution using O(k) space with an option for single pass O(1) space if we are allowed to overwrite the data.
For data that cannot be overwritten, ciamej provided a O(1) solution requiring up to k passes through the data, which works great.
However, for large lists (n) and large k we may want a faster solution. For example, with n=100,000,000 (distinct values) and k=100,000 we would have to check 10 trillion items with a branch on each item + an extra pass to get those items.
To reduce the passes over n we can create a small histogram of ranges. This requires a small storage space for the histogram, but since O(1) means constant space (i.e. not depending on n or k) I think we're allowed to do that. That space could be as small as an array of 2 * uint32. Histogram size should be a power of two, which allows us to use bit masking.
To keep the following example small and simple, we'll use a list containing 16-bit positive integers and a histogram of uint32[256] - but it will work with uint32[2] as well.
First, find the k-th smallest number - only 2 passes required:
uint32 hist[256];
First pass: group (count) by multiples of 256 - no branching besides the loop
loop:
hist[arr[i] & 0xff00 >> 8]++;
Now we have a count for each range and can calculate which bucket our k is in.
Save the total count up to that bucket and reset the histogram.
Second pass: fill the histogram again,
now masking the lower 8 bits and only for the numbers belonging in that range.
The range check can also be done with a mask
After this last pass, all values represented in the histogram are unique
and we can easily calculate where our k-th number is.
If the count in that slot (which represents our max value after restoring
with the previous mask) is higher than one, we'll have to remember that
when printing out the numbers.
This is explained in ciamej's post, so I won't repeat it here.
---
With hist[4] and a list of 32-bit integers we would need 8 passes.
The algorithm can easily be adjusted for signed integers.
Example:
k = 7
uint32_t hist[256]; // can be as small as hist[2]
uint16_t arr[]:
88
258
4
524
620
45
440
112
380
580
88
178
Fill histogram with:
hist[arr[i] & 0xff00 >> 8]++;
hist count
0 (0-255) 6
1 (256-511) 3 -> k
2 (512-767) 3
...
k is in hist[1] -> (256-511)
Clear histogram and fill with range (256-511):
Fill histogram with:
if (arr[i] & 0xff00 == 0x0100)
hist[arr[i] & 0xff]++;
Numbers in this range are:
258 & 0xff = 2
440 & 0xff = 184
380 & 0xff = 124
hist count
0 0
1 0
2 1 -> k
... 0
124 1
... 0
184 1
... 0
k - 6 (first pass) = 1
k is in hist[2], which is 2 + 256 = 258
Loop through arr[] to display the numbers <= 258 in preserved order.
Take care of possible duplicate highest numbers (hist[2] > 1 in this case).
we can easily calculate how many we have to print of those.
Further optimization:
If we can expect k to be in the lower ranges, we can even optimize this further by using the log2 values instead of fixed ranges:
There is a single CPU instruction to count the leading zero bits (or one bits)
so we don't have to call a standard log() function
but can call an intrinsic function instead.
This would require hist[65] for a list with 64-bit (positive) integers.
We would then have something like:
hist[ 64 - n_leading_zero_bits ]++;
This way the ranges we have to use in the following passes would be smaller.
You are given all subset sums of an array. You are then supposed to recover the original array from the subset sums provided.
Every element in the original array is guaranteed to be non-negative and less than 10^5. There are no more than 20 elements in the original array. The original array is also sorted. The input is guaranteed to be valid.
Example 1
If the subset sums provided are this:
0 1 5 6 6 7 11 12
We can quickly deduce that the size of the original array is 3 since there are 8 (2^3) subsets. The output (i.e original array) for the above input is this:
1 5 6
Example 2
Input:
0 1 1 2 8 9 9 10
Output:
1 1 8
What I Tried
Since all elements are guaranteed to be non-negative, the largest integer in the input must be the total of the array. However, I am not sure as to how do I proceed from there. By logic, I thought that the next (2^2 - 1) largest subset sums must include all except one element from the array.
However, the above logic does not work when the original array is this:
1 1 8
That's why I am stuck and am not sure on how to proceed on.
Say S is the subset sum array and A is the original array. I'm assuming S is sorted.
|A| = log2(|S|)
S[0] = 0
S[1] = A[0]
S[2] = A[1]
S[3] = EITHER A[2] OR A[0] + A[1].
In general, S[i] for i >= 3 is either an element of A or a combination of the elements of A that you've already encountered. When processing S, skip once per combination of known elements of A that generate a given number, add any remaining numbers to A. Stop when A gets to the right size.
E.g., if A=[1,2,7,8,9] then S will include [1,2,1+2=3,...,1+8=9, 2+7=9,9,...]. When processing S we skip over two 9s because of 1+8 and 2+7, then see a third 9 which we know must belong to A.
E.g., if S=[0,1,1,2,8,9,9,10] then we know A has 3 elements, that the first 2 elements of A are [1,1], when we get to 2 we skip it because 1+1=2, we append 8 and we're done because we have 3 elements.
Here's an easy algorithm that doesn't require finding which subset sums to a given number.
S ← input sequence
X ← empty sequence
While S has a non-zero element:
d ← second smallest element of S (the smallest one is always zero)
Insert d in X
N ← empty sequence
While S is not empty:
z ← smallest element of S
Remove both z and z+d from S (if S does not contain z+d, it's an error; remove only one instance of both z and z+d if there are several).
Insert z in N.
S ← N
Output X.
I revisited this question a few years later and finally managed to solve it! The approach that I've used to tackle this problem is the same as what Dave had devised earlier. Dave gave a pretty concrete explanation so I'll just add on some details and append my commented C++ code so that it's a bit more clear;
Excluding the empty set, the two smallest elements in S has to be the two smallest elements in A. This is because every element is guaranteed to be non-negative. Having known the values of A[0] and A[1], we have something tangible to work and build bottom-up with.
Following which, any new element in S can either be a summation of the previous elements we have confirmed to be in A or it can an entirely new element in A. (i.e S[3] = A[0] + A[1] or S[3] = A[2]) To keep track of this, we can use a frequency table such as an unordered_map<int, int> in C++. We then repeat this process for S[4], S[5]... to continue filling up A.
To prune our search space, we can stop the moment the size of A corresponds with the size of S. (i.e |A| = log(|S|)/log2). This help us drastically cut unnecessary computation and runtime.
#include <bits/stdc++.h>
using namespace std;
typedef vector<int> vi;
int main () {
int n; cin>>n;
vi S, A, sums;
unordered_map<int, int> freq;
for (int i=0;i<(int) pow(2.0, n);i++) {
int a; cin>>a;
S.push_back(a);
}
sort(S.begin(), S.end());
// edge cases
A.push_back(S[1]);
if (n == 1) {for (auto v : A) cout << v << "\n"; return 0;}
A.push_back(S[2]);
if (n == 2) {for (auto v : A) cout << v << "\n"; return 0;}
sums.push_back(0); sums.push_back(S[1]); sums.push_back(S[2]);
sums.push_back(S[1] + S[2]);
freq[S[1] + S[2]]++; // IMPT: we only need frequency of composite elements
for (int i=3; i < S.size(); i++) {
if (A.size() == n) break; // IMPT: prune the search space
// has to be a new element in A
if (freq[S[i]] == 0) {
// compute the new subset sums with the addition of a new element
vi newsums = sums;
for (int j=0;j<sums.size();j++) {
int y = sums[j] + S[i];
newsums.push_back(y);
if (j != 0) freq[y]++; // IMPT: coz we only need frequency of composite elements
}
// update A and subset sums
sums = newsums;
A.push_back(S[i]);
} else {
// has to be a summation of the previous elements in A
freq[S[i]]--;
}
}
for (auto v : A) cout << v << "\n";
}
Consider the following question:
Given a 2D array of unsigned integers and a maximum length n, find a path in that matrix that is not longer than n and which maximises the sum. The output should consist of both the path and the sum.
A path consists of neighbouring integers that are either all in the same row, or in the same column, or down a diagonal in the down-right direction.
For example, consider the following matrix and a given path length limit of 3:
1 2 3 4 5
2 1 2 2 1
3 4 5* 6 5
3 3 5 10* 5
1 2 5 7 15*
The most optimal path would be 5 + 10 + 15 (nodes are marked with *).
Now, upon seeing this problem, immediately a Dynamic Programming solution seems to be most appropriate here, given this problem's similarity to other problems like Min Cost Path or Maximum Sum Rectangular Submatrix. The issue is that in order to correctly solve this problem, you need to start building up the paths from every integer (node) in the matrix and not just start the path from the top left and end on the bottom right.
I was initially thinking of an approach similar to that of the solution for Maximum Sum Rectangular Submatrix in which I could store each possible path from every node (with path length less than n, only going right/down), but the only way I can envision that approach is by making recursive calls for down and right from each node which would seem to defeat the purpose of DP. Also, I need to be able to store the max path.
Another possible solution I was thinking about was somehow adapting a longest path search and running it from each int in the graph where each int is like an edge weight.
What would be the most efficient way to find the max path?
The challenge here is to avoid to sum the same nodes more than once. For that you could apply the following algorithm:
Algorithm
For each of the 3 directions (down, down+right, right) perform steps 2 and 3:
Determine the number of lines that exist in this direction. For the downward direction, this is the number of columns. For the rightward direction, this is the number of rows. For the diagonal direction, this is the number of diagonal lines, i.e. the sum of the number of rows and columns minus 1, as depicted by the red lines below:
For each line do:
Determine the first node on that line (call it the "head"), and also set the "tail" to that same node. These two references refer to the end points of the "current" path. Also set both the sum and path-length to zero.
For each head node on the current line perform the following bullet points:
Add the head node's value to the sum and increase the path length
If the path length is larger than the allowed maximum, subtract the tail's value from the sum, and set the tail to the node that follows it on the current line
Whenever the sum is greater than the greatest sum found so far, remember it together with the path's location.
Set the head to the node that follows it on the current line
At the end return the greatest sum and the path that generated this sum.
Code
Here is an implementation in basic JavaScript:
function maxPathSum(matrix, maxLen) {
var row, rows, col, cols, line, lines, dir, dirs, len,
headRow, headCol, tailRow, tailCol, sum, maxSum;
rows = matrix.length;
cols = matrix[0].length;
maxSum = -1;
dirs = 3; // Number of directions that paths can follow
if (maxLen == 1 || cols == 1)
dirs = 1; // Only need to check downward directions
for (dir = 1; dir <= 3; dir++) {
// Number of lines in this direction to try paths on
lines = [cols, rows, rows + cols - 1][dir-1];
for (line = 0; line < lines; line++) {
sum = 0;
len = 0;
// Set starting point depending on the direction
headRow = [0, line, line >= rows ? 0 : line][dir-1];
headCol = [line, 0, line >= rows ? line - rows : 0][dir-1];
tailRow = headRow;
tailCol = headCol;
// Traverse this line
while (headRow < rows && headCol < cols) {
// Lengthen the path at the head
sum += matrix[headRow][headCol];
len++;
if (len > maxLen) {
// Shorten the path at the tail
sum -= matrix[tailRow][tailCol];
tailRow += dir % 2;
tailCol += dir >> 1;
}
if (sum > maxSum) {
// Found a better path
maxSum = sum;
path = '(' + tailRow + ',' + tailCol + ') - '
+ '(' + headRow + ',' + headCol + ')';
}
headRow += dir % 2;
headCol += dir >> 1;
}
}
}
// Return the maximum sum and the string representation of
// the path that has this sum
return { maxSum, path };
}
// Sample input
var matrix = [
[1, 2, 3, 4, 5],
[2, 1, 2, 2, 1],
[3, 4, 5, 5, 5],
[3, 3, 5, 10, 5],
[1, 2, 5, 5, 15],
];
var best = maxPathSum(matrix, 3);
console.log(best);
Some details about the code
Be aware that row/column indexes start at 0.
The way the head and tail coordinates are incremented is based on the binary representation of the dir variable: it takes these three values (binary notation): 01, 10, 11
You can then take the first bit to indicate whether the next step in the direction is on the next column (1) or not (0), and the second bit to indicate whether it is on the next row (1) or not (0). You can depict it like this, where 00 represents the "current" node:
00 10
01 11
So we have this meaning to the values of dir:
01: walk along the column
10: walk along the row
11: walk diagonally
The code uses >>1 for extracting the first bit, and % 2 for extracting the last bit. That operation will result in a 0 or 1 in both cases, and is the value that needs to be added to either the column or the row.
The following expression creates a 1D array and takes one of its values on-the-fly:
headRow = [0, line, line >= rows ? 0 : line][dir-1];
It is short for:
switch (dir) {
case 1:
headRow = 0;
break;
case 2:
headRow = line;
break;
case 3:
if (line >= rows)
headRow = 0
else
headRow = line;
break;
}
Time and space complexity
The head will visit each node exactly once per direction. The tail will visit fewer nodes. The number of directions is constant, and the max path length value does not influence the number of head visits, so the time complexity is:
Θ(rows * columns)
There are no additional arrays used in this algorithm, just a few primitive variables. So the additional space complexity is:
Θ(1)
which both are the best you could hope for.
Is it Dynamic Programming?
In a DP solution you would typically use some kind of tabulation or memoization, possibly in the form of a matrix, where each sub-result found for a particular node is input for determining the result for neighbouring nodes.
Such solutions could need Θ(rows*columns) extra space. But this problem can be solved without such (extensive) space usage. When looking at one line at a time (a row, a column or a diagonal), the algorithm has some similarities with Kadane's algorithm:
One difference is that here the choice to extend or shorten the path/subarray is not dependent on the matrix data itself, but on the given maximum length. This is also related to the fact that here all values are guaranteed to be non-negative, while Kadane's algorithm is suitable for signed numbers.
Just like with Kadane's algorithm the best solution so far is maintained in a separate variable.
Another difference is that here we need to look in three directions. But that just means repeating the same algorithm in those three directions, while carrying over the best solution found so far.
This is a very basic use of Dynamic Programming, since you don't need the tabulation or memoization techniques here. We only keep the best results in the variables sum and maxSum. That cannot be viewed as tabluation or memoization, which typically keep track of several competing results that must be compared at some time. See this interesting answer on the subject.
Use F[i][j][k] as the max path sum where the path has length k and ends at position (i, j).
F[i][j][k] can be computed from F[i-1][j][k-1] and F[i][j-1][k-1].
The answer would be the maximum value of F.
To retrieve the max path, use another table G[i][j][k] to store the last step of F[i][j][k], i.e. it comes from (i-1,j) or (i,j-1).
The constraints are that the path can only be created by going down or to the right in the matrix.
Solution complexity O(N * M * L) where:
N: number of rows
M: number of columns
L: max length of the path
int solve(int x, int y, int l) {
if(x > N || y > M) { return -INF; }
if(l == 1) {matrix[x][y];}
if(dp[x][y][l] != -INF) {return dp[x][y][l];} // if cached before, return the answer
int option1 = solve(x+1, y, l-1); // take a step down
int option2 = solve(x, y+1, l-1); // take a step right
maxPath [x][n][l] = (option1 > option2 ) ? DOWN : RIGHT; // to trace the path
return dp[x][y][l] = max(option1, option2) + matrix[x][y];
}
example: solve(3,3,3): max path sum starting from (3,3) with length 3 ( 2 steps)
I have a question and I tried to think over it again and again... but got nothing so posting the question here. Maybe I could get some view-point of others, to try and make it work...
The question is: we are given a SORTED array, which consists of a collection of values occurring an EVEN number of times, except one, which occurs ODD number of times. We need to find the solution in log n time.
It is easy to find the solution in O(n) time, but it looks pretty tricky to perform in log n time.
Theorem: Every deterministic algorithm for this problem probes Ω(log2 n) memory locations in the worst case.
Proof (completely rewritten in a more formal style):
Let k > 0 be an odd integer and let n = k2. We describe an adversary that forces (log2 (k + 1))2 = Ω(log2 n) probes.
We call the maximal subsequences of identical elements groups. The adversary's possible inputs consist of k length-k segments x1 x2 … xk. For each segment xj, there exists an integer bj ∈ [0, k] such that xj consists of bj copies of j - 1 followed by k - bj copies of j. Each group overlaps at most two segments, and each segment overlaps at most two groups.
Group boundaries
| | | | |
0 0 1 1 1 2 2 3 3
| | | |
Segment boundaries
Wherever there is an increase of two, we assume a double boundary by convention.
Group boundaries
| || | |
0 0 0 2 2 2 2 3 3
Claim: The location of the jth group boundary (1 ≤ j ≤ k) is uniquely determined by the segment xj.
Proof: It's just after the ((j - 1) k + bj)th memory location, and xj uniquely determines bj. //
We say that the algorithm has observed the jth group boundary in case the results of its probes of xj uniquely determine xj. By convention, the beginning and the end of the input are always observed. It is possible for the algorithm to uniquely determine the location of a group boundary without observing it.
Group boundaries
| X | | |
0 0 ? 1 2 2 3 3 3
| | | |
Segment boundaries
Given only 0 0 ?, the algorithm cannot tell for sure whether ? is a 0 or a 1. In context, however, ? must be a 1, as otherwise there would be three odd groups, and the group boundary at X can be inferred. These inferences could be problematic for the adversary, but it turns out that they can be made only after the group boundary in question is "irrelevant".
Claim: At any given point during the algorithm's execution, consider the set of group boundaries that it has observed. Exactly one consecutive pair is at odd distance, and the odd group lies between them.
Proof: Every other consecutive pair bounds only even groups. //
Define the odd-length subsequence bounded by the special consecutive pair to be the relevant subsequence.
Claim: No group boundary in the interior of the relevant subsequence is uniquely determined. If there is at least one such boundary, then the identity of the odd group is not uniquely determined.
Proof: Without loss of generality, assume that each memory location not in the relevant subsequence has been probed and that each segment contained in the relevant subsequence has exactly one location that has not been probed. Suppose that the jth group boundary (call it B) lies in the interior of the relevant subsequence. By hypothesis, the probes to xj determine B's location up to two consecutive possibilities. We call the one at odd distance from the left observed boundary odd-left and the other odd-right. For both possibilities, we work left to right and fix the location of every remaining interior group boundary so that the group to its left is even. (We can do this because they each have two consecutive possibilities as well.) If B is at odd-left, then the group to its left is the unique odd group. If B is at odd-right, then the last group in the relevant subsequence is the unique odd group. Both are valid inputs, so the algorithm has uniquely determined neither the location of B nor the odd group. //
Example:
Observed group boundaries; relevant subsequence marked by […]
[ ] |
0 0 Y 1 1 Z 2 3 3
| | | |
Segment boundaries
Possibility #1: Y=0, Z=2
Possibility #2: Y=1, Z=2
Possibility #3: Y=1, Z=1
As a consequence of this claim, the algorithm, regardless of how it works, must narrow the relevant subsequence to one group. By definition, it therefore must observe some group boundaries. The adversary now has the simple task of keeping open as many possibilities as it can.
At any given point during the algorithm's execution, the adversary is internally committed to one possibility for each memory location outside of the relevant subsequence. At the beginning, the relevant subsequence is the entire input, so there are no initial commitments. Whenever the algorithm probes an uncommitted location of xj, the adversary must commit to one of two values: j - 1, or j. If it can avoid letting the jth boundary be observed, it chooses a value that leaves at least half of the remaining possibilities (with respect to observation). Otherwise, it chooses so as to keep at least half of the groups in the relevant interval and commits values for the others.
In this way, the adversary forces the algorithm to observe at least log2 (k + 1) group boundaries, and in observing the jth group boundary, the algorithm is forced to make at least log2 (k + 1) probes.
Extensions:
This result extends straightforwardly to randomized algorithms by randomizing the input, replacing "at best halved" (from the algorithm's point of view) with "at best halved in expectation", and applying standard concentration inequalities.
It also extends to the case where no group can be larger than s copies; in this case the lower bound is Ω(log n log s).
A sorted array suggests a binary search. We have to redefine equality and comparison. Equality simple means an odd number of elements. We can do comparison by observing the index of the first or last element of the group. The first element will be an even index (0-based) before the odd group, and an odd index after the odd group. We can find the first and last elements of a group using binary search. The total cost is O((log N)²).
PROOF OF O((log N)²)
T(2) = 1 //to make the summation nice
T(N) = log(N) + T(N/2) //log(N) is finding the first/last elements
For some N=2^k,
T(2^k) = (log 2^k) + T(2^(k-1))
= (log 2^k) + (log 2^(k-1)) + T(2^(k-2))
= (log 2^k) + (log 2^(k-1)) + (log 2^(k-2)) + ... + (log 2^2) + 1
= k + (k-1) + (k-2) + ... + 1
= k(k+1)/2
= (k² + k)/2
= (log(N)² + log(N))/ 2
= O(log(N)²)
Look at the middle element of the array. With a couple of appropriate binary searches, you can find the first and its last appearance in the array. E.g., if the middle element is 'a', you need to find i and j as shown below:
[* * * * a a a a * * *]
^ ^
| |
| |
i j
Is j - i an even number? You are done! Otherwise (and this is the key here), the question to ask is i an even or an odd number? Do you see what this piece of knowledge implies? Then the rest is easy.
This answer is in support of the answer posted by "throwawayacct". He deserves the bounty. I spent some time on this question and I'm totally convinced that his proof is correct that you need Ω(log(n)^2) queries to find the number that occurs an odd number of times. I'm convinced because I ended up recreating the exact same argument after only skimming his solution.
In the solution, an adversary creates an input to make life hard for the algorithm, but also simple for a human analyzer. The input consists of k pages that each have k entries. The total number of entries is n = k^2, and it is important that O(log(k)) = O(log(n)) and Ω(log(k)) = Ω(log(n)). To make the input, the adversary makes a string of length k of the form 00...011...1, with the transition in an arbitrary position. Then each symbol in the string is expanded into a page of length k of the form aa...abb...b, where on the ith page, a=i and b=i+1. The transition on each page is also in an arbitrary position, except that the parity agrees with the symbol that the page was expanded from.
It is important to understand the "adversary method" of analyzing an algorithm's worst case. The adversary answers queries about the algorithm's input, without committing to future answers. The answers have to be consistent, and the game is over when the adversary has been pinned down enough for the algorithm to reach a conclusion.
With that background, here are some observations:
1) If you want to learn the parity of a transition in a page by making queries in that page, you have to learn the exact position of the transition and you need Ω(log(k)) queries. Any collection of queries restricts the transition point to an interval, and any interval of length more than 1 has both parities. The most efficient search for the transition in that page is a binary search.
2) The most subtle and most important point: There are two ways to determine the parity of a transition inside a specific page. You can either make enough queries in that page to find the transition, or you can infer the parity if you find the same parity in both an earlier and a later page. There is no escape from this either-or. Any set of queries restricts the transition point in each page to some interval. The only restriction on parities comes from intervals of length 1. Otherwise the transition points are free to wiggle to have any consistent parities.
3) In the adversary method, there are no lucky strikes. For instance, suppose that your first query in some page is toward one end instead of in the middle. Since the adversary hasn't committed to an answer, he's free to put the transition on the long side.
4) The end result is that you are forced to directly probe the parities in Ω(log(k)) pages, and the work for each of these subproblems is also Ω(log(k)).
5) Things are not much better with random choices than with adversarial choices. The math is more complicated, because now you can get partial statistical information, rather than a strict yes you know a parity or no you don't know it. But it makes little difference. For instance, you can give each page length k^2, so that with high probability, the first log(k) queries in each page tell you almost nothing about the parity in that page. The adversary can make random choices at the beginning and it still works.
Start at the middle of the array and walk backward until you get to a value that's different from the one at the center. Check whether the number above that boundary is at an odd or even index. If it's odd, then the number occurring an odd number of times is to the left, so repeat your search between the beginning and the boundary you found. If it's even, then the number occurring an odd number of times must be later in the array, so repeat the search in the right half.
As stated, this has both a logarithmic and a linear component. If you want to keep the whole thing logarithmic, instead of just walking backward through the array to a different value, you want to use a binary search instead. Unless you expect many repetitions of the same numbers, the binary search may not be worthwhile though.
I have an algorithm which works in log(N/C)*log(K), where K is the length of maximum same-value range, and C is the length of range being searched for.
The main difference of this algorithm from most posted before is that it takes advantage of the case where all same-value ranges are short. It finds boundaries not by binary-searching the entire array, but by first quickly finding a rough estimate by jumping back by 1, 2, 4, 8, ... (log(K) iterations) steps, and then binary-searching the resulting range (log(K) again).
The algorithm is as follows (written in C#):
// Finds the start of the range of equal numbers containing the index "index",
// which is assumed to be inside the array
//
// Complexity is O(log(K)) with K being the length of range
static int findRangeStart (int[] arr, int index)
{
int candidate = index;
int value = arr[index];
int step = 1;
// find the boundary for binary search:
while(candidate>=0 && arr[candidate] == value)
{
candidate -= step;
step *= 2;
}
// binary search:
int a = Math.Max(0,candidate);
int b = candidate+step/2;
while(a+1!=b)
{
int c = (a+b)/2;
if(arr[c] == value)
b = c;
else
a = c;
}
return b;
}
// Finds the index after the only "odd" range of equal numbers in the array.
// The result should be in the range (start; end]
// The "end" is considered to always be the end of some equal number range.
static int search(int[] arr, int start, int end)
{
if(arr[start] == arr[end-1])
return end;
int middle = (start+end)/2;
int rangeStart = findRangeStart(arr,middle);
if((rangeStart & 1) == 0)
return search(arr, middle, end);
return search(arr, start, rangeStart);
}
// Finds the index after the only "odd" range of equal numbers in the array
static int search(int[] arr)
{
return search(arr, 0, arr.Length);
}
Take the middle element e. Use binary search to find the first and last occurrence. O(log(n))
If it is odd return e.
Otherwise, recurse onto the side that has an odd number of elements [....]eeee[....]
Runtime will be log(n) + log(n/2) + log(n/4).... = O(log(n)^2).
AHhh. There is an answer.
Do a binary search and as you search, for each value, move backwards until you find the first entry with that same value. If its index is even, it is before the oddball, so move to the right.
If its array index is odd, it is after the oddball, so move to the left.
In pseudocode (this is the general idea, not tested...):
private static int FindOddBall(int[] ary)
{
int l = 0,
r = ary.Length - 1;
int n = (l+r)/2;
while (r > l+2)
{
n = (l + r) / 2;
while (ary[n] == ary[n-1])
n = FindBreakIndex(ary, l, n);
if (n % 2 == 0) // even index we are on or to the left of the oddball
l = n;
else // odd index we are to the right of the oddball
r = n-1;
}
return ary[l];
}
private static int FindBreakIndex(int[] ary, int l, int n)
{
var t = ary[n];
var r = n;
while(ary[n] != t || ary[n] == ary[n-1])
if(ary[n] == t)
{
r = n;
n = (l + r)/2;
}
else
{
l = n;
n = (l + r)/2;
}
return n;
}
You can use this algorithm:
int GetSpecialOne(int[] array, int length)
{
int specialOne = array[0];
for(int i=1; i < length; i++)
{
specialOne ^= array[i];
}
return specialOne;
}
Solved with the help of a similar question which can be found here on http://www.technicalinterviewquestions.net
We don't have any information about the distribution of lenghts inside the array, and of the array as a whole, right?
So the arraylength might be 1, 11, 101, 1001 or something, 1 at least with no upper bound, and must contain at least 1 type of elements ('number') up to (length-1)/2 + 1 elements, for total sizes of 1, 11, 101: 1, 1 to 6, 1 to 51 elements and so on.
Shall we assume every possible size of equal probability? This would lead to a middle length of subarrays of size/4, wouldn't it?
An array of size 5 could be divided into 1, 2 or 3 sublists.
What seems to be obvious is not that obvious, if we go into details.
An array of size 5 can be 'divided' into one sublist in just one way, with arguable right to call it 'dividing'. It's just a list of 5 elements (aaaaa). To avoid confusion let's assume the elements inside the list to be ordered characters, not numbers (a,b,c, ...).
Divided into two sublist, they might be (1, 4), (2, 3), (3, 2), (4, 1). (abbbb, aabbb, aaabb, aaaab).
Now let's look back at the claim made before: Shall the 'division' (5) be assumed the same probability as those 4 divisions into 2 sublists? Or shall we mix them together, and assume every partition as evenly probable, (1/5)?
Or can we calculate the solution without knowing the probability of the length of the sublists?
The clue is you're looking for log(n). That's less than n.
Stepping through the entire array, one at a time? That's n. That's not going to work.
We know the first two indexes in the array (0 and 1) should be the same number. Same with 50 and 51, if the odd number in the array is after them.
So find the middle element in the array, compare it to the element right after it. If the change in numbers happens on the wrong index, we know the odd number in the array is before it; otherwise, it's after. With one set of comparisons, we figure out which half of the array the target is in.
Keep going from there.
Use a hash table
For each element E in the input set
if E is set in the hash table
increment it's value
else
set E in the hash table and initialize it to 0
For each key K in hash table
if K % 2 = 1
return K
As this algorithm is 2n it belongs to O(n)
Try this:
int getOddOccurrence(int ar[], int ar_size)
{
int i;
int xor = 0;
for (i=0; i < ar_size; i++)
xor = xor ^ ar[i];
return res;
}
XOR will cancel out everytime you XOR with the same number so 1^1=0 but 1^1^1=1 so every pair should cancel out leaving the odd number out.
Assume indexing start at 0. Binary search for the smallest even i such that x[i] != x[i+1]; your answer is x[i].
edit: due to public demand, here is the code
int f(int *x, int min, int max) {
int size = max;
min /= 2;
max /= 2;
while (min < max) {
int i = (min + max)/2;
if (i==0 || x[2*i-1] == x[2*i])
min = i+1;
else
max = i-1;
}
if (2*max == size || x[2*max] != x[2*max+1])
return x[2*max];
return x[2*min];
}