Median of 5 sorted arrays - arrays

I am trying to find the solution for median of 5 sorted arrays. This was an interview questions.
The solution I could think of was merge the 5 arrays and then find the median [O(l+m+n+o+p)].
I know that for 2 sorted arrays of same size we can do it in log(2n). [by comparing the median of both arrays and then throwing out 1 half of each array and repeating the process]. .. Finding median can be constant time in sorted arrays .. so I think this is not log(n) ? .. what is the time complexity for this ?
1] Is there a similar solution for 5 arrays . What if the arrays are of same size , is there a better solution then ?
2] I assume since this was asked for 5, there would be some solution for N sorted arrays ?
Thanks for any pointers.
Some clarification/questions I asked back to the interviewer:
Are the arrays of same length
=> No
I guess there would be an overlap in the values of arrays
=> Yes
As an exercise, I think the logic for 2 arrays doesnt extend . Here is a try:
Applying the above logic of 2 arrays to say 3 arrays:
[3,7,9] [4,8,15] [2,3,9] ... medians 7,8,3
throw elements [3,7,9] [4,8] [3,9] .. medians 7,6,6
throw elements [3,7] [8] [9] ..medians 5,8,9 ...
throw elements [7] [8] [9] .. median = 8 ... This doesnt seem to be correct ?
The merge of sorted elements => [2,3,4,7,8,9,15] => expected median = 7

(This is a generalization of your idea for two arrays.)
If you start by looking at the five medians of the five arrays, obviously the overall median must be between the smallest and the largest of the five medians.
Proof goes something like this: If a is the min of the medians, and b is the max of the medians, then each array has less than half of its elements less than a and less than half of its elements greater than b. Result follows.
So in the array containing a, throw away numbers less than a; in the array containing b, throw away numbers greater than b... But only throw away the same number of elements from both arrays.
That is, if a is j elements from the start of its array, and b is k elements from the end of its array, you throw away the first min(j,k) elements from a's array and the last min(j,k) elements from b's array.
Iterate until you are down to 1 or 2 elements total.
Each of these operations (i.e., finding median of a sorted array and throwing away k elements from the start or end of an array) is constant time. So each iteration is constant time.
Each iteration throws away (more than) half the elements from at least one array, and you can only do that log(n) times for each of the five arrays... So the overall algorithm is log(n).
[Update]
As Himadri Choudhury points out in the comments, my solution is incomplete; there are a lot of details and corner cases to worry about. So, to flesh things out a bit...
For each of the five arrays R, define its "lower median" as R[n/2-1] and its "upper median" as R[n/2], where n is the number of elements in the array (and arrays are indexed from 0, and division by 2 rounds down).
Let "a" be the smallest of the lower medians, and "b" be the largest of the upper medians. If there are multiple arrays with the smallest lower median and/or multiple arrays with the largest upper median, choose a and b from different arrays (this is one of those corner cases).
Now, borrowing Himadri's suggestion: Erase all elements up to and including a from its array, and all elements down to and including b from its array, taking care to remove the same number of elements from both arrays. Note that a and b could be in the same array; but if so, they could not have the same value, because otherwise we would have been able to choose one of them from a different array. So it is OK if this step winds up throwing away elements from the start and end of the same array.
Iterate as long as you have three or more arrays. But once you are down to just one or two arrays, you have to change your strategy to be exclusive instead of inclusive; you only erase up to but not including a and down to but not including b. Continue like this as long as both of the remaining one or two arrays has at least three elements (guaranteeing you make progress).
Finally, you will reduce to a few cases, the trickiest of which is two arrays remaining, one of which has one or two elements. Now, if I asked you: "Given a sorted array plus one or two additional elements, find the median of all elements", I think you can do that in constant time. (Again, there are a bunch of details to hammer out, but the basic idea is that adding one or two elements to an array does not "push the median around" very much.)

Should be pretty straight to apply the same idea to 5 arrays.
First, convert the question to more general one. Finding Kth element in N sorted arrays
Find (K/N)th element in each sorted array with binary search, say K1, K2... KN
Kmin = min(K1 ... KN), Kmax = max(K1 ... KN)
Throw away all elements less than Kmin or larger than Kmax, say X elements has been thrown away.
Now repeat the process by find (K - X)th element in sorted arrays with remaining elements

You don't need to do a complete merge of the 5 arrays. You can do a merge sort until you have (l+n+o+p+q)/2 elements then you have the median value.

Finding the kth element in a list of sorted lists can be done by binary search.
from bisect import bisect_left
from bisect import bisect_right
def kthOfPiles(givenPiles, k, count):
'''
Perform binary search for kth element in multiple sorted list
parameters
==========
givenPiles are list of sorted list
count is the total number of
k is the target index in range [0..count-1]
'''
begins = [0 for pile in givenPiles]
ends = [len(pile) for pile in givenPiles]
#print('finding k=', k, 'count=', count)
for pileidx,pivotpile in enumerate(givenPiles):
while begins[pileidx] < ends[pileidx]:
mid = (begins[pileidx]+ends[pileidx])>>1
midval = pivotpile[mid]
smaller_count = 0
smaller_right_count = 0
for pile in givenPiles:
smaller_count += bisect_left(pile,midval)
smaller_right_count += bisect_right(pile,midval)
#print('check midval', midval,smaller_count,k,smaller_right_count)
if smaller_count <= k and k < smaller_right_count:
return midval
elif smaller_count > k:
ends[pileidx] = mid
else:
begins[pileidx] = mid+1
return -1
def medianOfPiles(givenPiles,count=None):
'''
Find statistical median
Parameters:
givenPiles are list of sorted list
'''
if not givenPiles:
return -1 # cannot find median
if count is None:
count = 0
for pile in givenPiles:
count += len(pile)
# get mid floor
target_mid = count >> 1
midval = kthOfPiles(givenPiles, target_mid, count)
if 0 == (count&1):
midval += kthOfPiles(givenPiles, target_mid-1, count)
midval /= 2
return '%.1f' % round(midval,1)
The code above gives correct-statistical median as well.
Coupling this above binary search with patience-sort, gives a valuable technique.
There is worth mentioning median of median algorithm for selecting pivot. It gives approximate value. I guess that is different from what we are asking here.

Use heapq to keep each list's minum candidates.
Prerequisite: N sorted K length list
O(NKlgN)
import heapq
class Solution:
def f1(self, AS):
def f(A):
n = len(A)
m = n // 2
if n % 2:
return A[m]
else:
return (A[m - 1] + A[m]) / 2
res = []
q = []
for i, A in enumerate(AS):
q.append([A[0], i, 0])
heapq.heapify(q)
N, K = len(AS), len(AS[0])
while len(res) < N * K:
mn, i, ii = heapq.heappop(q)
res.append(mn)
if ii < K - 1:
heapq.heappush(q, [AS[i][ii + 1], i, ii + 1])
return f(res)
def f2(self, AS):
q = []
for i, A in enumerate(AS):
q.append([A[0], i, 0])
heapq.heapify(q)
N, K = len(AS), len(AS[0])
n = N * K
m = n // 2
m1 = m2 = float('-inf')
k = 0
while k < N * K:
mn, i, ii = heapq.heappop(q)
res.append(mn)
k += 1
if k == m - 1:
m1 = mn
elif k == m:
m2 = mn
return m2 if n % 2 else (m1 + m2) / 2
if ii < K - 1:
heapq.heappush(q, [AS[i][ii + 1], i, ii + 1])
return 'should not go here'

Related

Optimal way to find number of operation required to convert all K numbers to lie in the range [L,R] (i.e. L≤x≤R)

I am solving this question which requires some optimized techniques to
solve it. I can think of the brute force method only which requires
combinatorics.
Given an array A consisting of n integers. We call an integer "good"
if it lies in the range [L,R] (i.e. L≤x≤R). We need to make sure if we
pick up any K integers from the array at least one of them should be a
good integer.
For achieving this, in a single operation, we are allowed to
increase/decrease any element of the array by one.
What will be the minimum number of operations we will need for a
fixed k?"
i.e k=1 to n.
input:
L R
1 2
A=[ 1 3 3 ]
output:
for k=1 : 2
for k=2 : 1
for k=3 : 0
For k=1, you have to convert both the 3s into 2s to make sure that if
you select any one of the 3 integers, the selected integer is good.
For k=2, one of the possible ways is to convert one of the 3s into 2.
For k=3, no operation is needed as 1 is a good integer.
As burnpanck has explained in his answer, to make sure that when you pick any k elements in the array, and at least one of them is in range [L,R], we need to make sure that there are at least n - k + 1 numbers in range [L,R] in the array.
So, first , for each element, we calculate the cost to make this element be a valid element (which is in range [L,R]) and store those cost in an array cost.
We notice that:
For k = 1, the minimum cost is the sum of array cost.
For k = 2, the minimum cost is the sum of cost, minus the largest element.
For k = 3, the minimum cost is the sum of cost, minus two largest elements.
...
So, we need to have a prefixSum array, which ith position is the sum of sorted cost array from 0 to ith.
After calculate prefixSum, we can answer result for each k in O(1)
So here is the algo in Java, notice the time complexity is O(n logn):
int[]cost = new int[n];
for(int i = 0; i < n; i++)
cost[i] = //Calculate min cost for element i
Arrays.sort(cost);
int[]prefix = new int[n];
for(int i = 0; i < n; i++)
prefix[i] = cost[i] + (i > 0 ? prefix[i - 1] : 0);
for(int i = n - 1; i >= 0; i--)
System.out.println("Result for k = " + (n - i) + " is " + prefix[i]);
To be sure that from picking k elements will give at least one valid means you should have not more than k-1 invalid in your set. You therefore need to find the shortest way to make enough elements valid. This I would do as follows: In a single pass, generate a map that counts how many elements are in the set that need $n$ operations to be made valid. Then, you clearly want to take those elements that need the least operations, so take the required number of elements in ascending order of required number of operations, and sum the number of operations.
In python:
def min_ops(L,R,A_set):
n_ops = dict() # create an empty mapping
for a in A_set: # loop over all a in the set A_set
n = max(0,max(a-R,L-a)) # the number of operations requied to make a valid
n_ops[n] = n_ops.get(n,0) + 1 # in the mapping, increment the element keyed by *n* by ones. If it does not exist yet, assume it was 0.
allret = [] # create a new list to hold the result for all k
for k in range(1,len(A_set)+1): # iterate over all k in the range [1,N+1) == [1,N]
n_good_required = len(A_set) - k + 1
ret = 0
# iterator over all pairs of keys,values from the mapping, sorted by key.
# The key is the number of ops required, the value the number of elements available
for n,nel in sorted(n_ops.items()):
if n_good_required:
return ret
ret += n * min(nel,n_good_required)
n_good_required -= nel
allret.append(ret) # append the answer for this k to the result list
return allret
As an example:
A_set = [1,3,3,6,8,5,4,7]
L,R = 4,6
For each A, we find how many operations we need to make it valid:
n = [3,1,1,0,2,0,0,1]
(i.e. 1 needs 3 steps, 3 needs one, and so on)
Then we count them:
n_ops = {
0: 3, # we already have three valid elements
1: 3, # three elements that require one op
2: 1,
3: 1, # and finally one that requires 3 ops
}
Now, for each k, we find out how many valid elements we need in the set,
e.g. for k = 4, we need at most 3 invalid in the set of 8, so we need 5 valid ones.
Thus:
ret = 0
n_good_requied = 5
with n=0, we have 3 so take all of them
ret = 0
n_good_required = 2
with n=1, we have 3, but we need just two, so take those
ret = 2
we're finished

Find the median of the sum of the arrays

Two sorted arrays of length n are given and the question is to find, in O(n) time, the median of their sum array, which contains all the possible pairwise sums between every element of array A and every element of array B.
For instance: Let A[2,4,6] and B[1,3,5] be the two given arrays.
The sum array is [2+1,2+3,2+5,4+1,4+3,4+5,6+1,6+3,6+5]. Find the median of this array in O(n).
Solving the question in O(n^2) is pretty straight-forward but is there any O(n) solution to this problem?
Note: This is an interview question asked to one of my friends and the interviewer was quite sure that it can be solved in O(n) time.
The correct O(n) solution is quite complicated, and takes a significant amount of text, code and skill to explain and prove. More precisely, it takes 3 pages to do so convincingly, as can be seen in details here http://www.cse.yorku.ca/~andy/pubs/X+Y.pdf (found by simonzack in the comments).
It is basically a clever divide-and-conquer algorithm that, among other things, takes advantage of the fact that in a sorted n-by-n matrix, one can find in O(n) the amount of elements that are smaller/greater than a given number k. It recursively breaks down the matrix into smaller submatrixes (by taking only the odd rows and columns, resulting in a submatrix that has n/2 colums and n/2 rows) which combined with the step above, results in a complexity of O(n) + O(n/2) + O(n/4)... = O(2*n) = O(n). It is crazy!
I can't explain it better than the paper, which is why I'll explain a simpler, O(n logn) solution instead :).
O(n * logn) solution:
It's an interview! You can't get that O(n) solution in time. So hey, why not provide a solution that, although not optimal, shows you can do better than the other obvious O(n²) candidates?
I'll make use of the O(n) algorithm mentioned above, to find the amount of numbers that are smaller/greater than a given number k in a sorted n-by-n matrix. Keep in mind that we don't need an actual matrix! The Cartesian sum of two arrays of size n, as described by the OP, results in a sorted n-by-n matrix, which we can simulate by considering the elements of the array as follows:
a[3] = {1, 5, 9};
b[3] = {4, 6, 8};
//a + b:
{1+4, 1+6, 1+8,
5+4, 5+6, 5+8,
9+4, 9+6, 9+8}
Thus each row contains non-decreasing numbers, and so does each column. Now, pretend you're given a number k. We want to find in O(n) how many of the numbers in this matrix are smaller than k, and how many are greater. Clearly, if both values are less than (n²+1)/2, that means k is our median!
The algorithm is pretty simple:
int smaller_than_k(int k){
int x = 0, j = n-1;
for(int i = 0; i < n; ++i){
while(j >= 0 && k <= a[i]+b[j]){
--j;
}
x += j+1;
}
return x;
}
This basically counts how many elements fit the condition at each row. Since the rows and columns are already sorted as seen above, this will provide the correct result. And as both i and j iterate at most n times each, the algorithm is O(n) [Note that j does not get reset within the for loop]. The greater_than_k algorithm is similar.
Now, how do we choose k? That is the logn part. Binary Search! As has been mentioned in other answers/comments, the median must be a value contained within this array:
candidates[n] = {a[0]+b[n-1], a[1]+b[n-2],... a[n-1]+b[0]};.
Simply sort this array [also O(n*logn)], and run the binary search on it. Since the array is now in non-decreasing order, it is straight-forward to notice that the amount of numbers smaller than each candidate[i] is also a non-decreasing value (monotonic function), which makes it suitable for the binary search. The largest number k = candidate[i] whose result smaller_than_k(k) returns smaller than (n²+1)/2 is the answer, and is obtained in log(n) iterations:
int b_search(){
int lo = 0, hi = n, mid, n2 = (n²+1)/2;
while(hi-lo > 1){
mid = (hi+lo)/2;
if(smaller_than_k(candidate[mid]) < n2)
lo = mid;
else
hi = mid;
}
return candidate[lo]; // the median
}
Let's say the arrays are A = {A[1] ... A[n]}, and B = {B[1] ... B[n]}, and the pairwise sum array is C = {A[i] + B[j], where 1 <= i <= n, 1 <= j <= n} which has n^2 elements and we need to find its median.
Median of C must be an element of the array D = {A[1] + B[n], A[2] + B[n - 1], ... A[n] + B[1]}: if you fix A[i], and consider all the sums A[i] + B[j], you would see that the only A[i] + B[j = n + 1 - i] (which is one of D) could be the median. That is, it may not be the median, but if it is not, then all other A[i] + B[j] are also not median.
This can be proved by considering all B[j] and count the number of values that are lower and number of values that are greater than A[i] + B[j] (we can do this quite accurately because the two arrays are sorted -- the calculation is a bit messy thought). You'd see that for A[i] + B[n + 1 - j] these two counts are most "balanced".
The problem then reduces to finding median of D, which has only n elements. An algorithm such as Hoare's will work.
UPDATE: this answer is wrong. The real conclusion here is that the median is one of D's element, but then D's median is the not the same as C's median.
Doesn't this work?:
You can compute the rank of a number in linear time as long as A and B are sorted. The technique you use for computing the rank can also be used to find all things in A+B that are between some lower bound and some upper bound in time linear the size of the output plus |A|+|B|.
Randomly sample n things from A+B. Take the median, say foo. Compute the rank of foo. With constant probability, foo's rank is within n of the median's rank. Keep doing this (an expected constant number of times) until you have lower and upper bounds on the median that are within 2n of each other. (This whole process takes expected linear time, but it's obviously slow.)
All you have to do now is enumerate everything between the bounds and do a linear-time selection on a linear-sized list.
(Unrelatedly, I wouldn't excuse the interviewer for asking such an obviously crappy interview question. Stuff like this in no way indicates your ability to code.)
EDIT: You can compute the rank of a number x by doing something like this:
Set i = j = 0.
While j < |B| and A[i] + B[j] <= x, j++.
While i < |A| {
While A[i] + B[j] > x and j >= 0, j--.
If j < 0, break.
rank += j+1.
i++.
}
FURTHER EDIT: Actually, the above trick only narrows down the candidate space to about n log(n) members of A+B. Then you have a general selection problem within a universe of size n log(n); you can do basically the same trick one more time and find a range of size proportional to sqrt(n) log(n) where you do selection.
Here's why: If you sample k things from an n-set and take the median, then the sample median's order is between the (1/2 - sqrt(log(n) / k))th and the (1/2 + sqrt(log(n) / k))th elements with at least constant probability. When n = |A+B|, we'll want to take k = sqrt(n) and we get a range of about sqrt(n log n) elements --- that's about |A| log |A|. But then you do it again and you get a range on the order of sqrt(n) polylog(n).
You should use a selection algorithm to find the median of an unsorted list in O(n). Look at this: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

Is there an O(n) algorithm to generate a prefix-less array for an positive integer array?

For array [4,3,5,1,2],
we call prefix of 4 is NULL, prefix-less of 4 is 0;
prefix of 3 is [4], prefix-less of 3 is 0, because none in prefix is less than 3;
prefix of 5 is [4,3], prefix-less of 5 is 2, because 4 and 3 are both less than 5;
prefix of 1 is [4,3,5], prefix-less of 1 is 0, because none in prefix is less than 1;
prefix of 2 is [4,3,5,1], prefix-less of 2 is 1, because only 1 is less than 2
So for array [4, 3, 5, 1, 2], we get prefix-less arrary of [0,0, 2,0,1],
Can we get an O(n) algorithm to get prefix-less array?
It can't be done in O(n) for the same reasons a comparison sort requires O(n log n) comparisons. The number of possible prefix-less arrays is n! so you need at least log2(n!) bits of information to identify the correct prefix-less array. log2(n!) is O(n log n), by Stirling's approximation.
Assuming that the input elements are always fixed-width integers you can use a technique based on radix sort to achieve linear time:
L is the input array
X is the list of indexes of L in focus for current pass
n is the bit we are currently working on
Count is the number of 0 bits at bit n left of current location
Y is the list of indexs of a subsequence of L for recursion
P is a zero initialized array that is the output (the prefixless array)
In pseudo-code...
Def PrefixLess(L, X, n)
if (n == 0)
return;
// setup prefix less for bit n
Count = 0
For I in 1 to |X|
P(I) += Count
If (L(X(I))[n] == 0)
Count++;
// go through subsequence with bit n-1 with bit(n) = 1
Y = []
For I in 1 to |X|
If (L(X(I))[n] == 1)
Y.append(X(I))
PrefixLess(L, Y, n-1)
// go through subsequence on bit n-1 where bit(n) = 0
Y = []
For I in 1 to |X|
If (L(X(I))[n] == 0)
Y.append(X(I))
PrefixLess(L, Y, n-1)
return P
and then execute:
PrefixLess(L, 1..|L|, 32)
I think this should work, but double check the details. Let's call an element in the original array a[i] and one in the prefix array as p[i] where i is the ith element of the respective arrays.
So, say we are at a[i] and we have already computed the value of p[i]. There are three possible cases. If a[i] == a[i+1], then p[i] == p[i+1]. If a[i] < a[i+1], then p[i+1] >= p[i] + 1. This leaves us with the case where a[i] > a[i+1]. In this situation we know that p[i+1] >= p[i].
In the naïve case, we go back through the prefix and start counting items less than a[i]. However, we can do better than that. First, recognize that the minimum value for p[i] is 0 and the maximum is i. Next look at the case of an index j, where i > j. If a[i] >= a[j], then p[i] >= p[j]. If a[i] < a[j], then p[i] <= p[j] + j . So, we can start going backwards through p updating the values for p[i]_min and p[i]_max. If p[i]_min equals p[i]_max, then we have our solution.
Doing a back of the envelope analysis of the algorithm, it has O(n) best case performance. This is the case where the list is already sorted. The worst case is where it is reversed sorted. Then the performance is O(n^2). The average performance is going to be O(k*n) where k is how much one needs to backtrack. My guess is for randomly distributed integers, k will be small.
I am also pretty sure there would be ways to optimize this algorithm for cases of partially sorted data. I would look at Timsort for some inspiration on how to do this. It uses run detection to detect partially sorted data. So the basic idea for the algorithm would be to go through the list once and look for runs of data. For ascending runs of data you are going to have the case where p[i+1] = p[i]+1. For descending runs, p[i] = p_run[0] where p_run is the first element in the run.

Finding kth smallest number from n sorted arrays

So, you have n sorted arrays (not necessarily of equal length), and you are to return the kth smallest element in the combined array (i.e the combined array formed by merging all the n sorted arrays)
I have been trying it and its other variants for quite a while now, and till now I only feel comfortable in the case where there are two arrays of equal length, both sorted and one has to return the median of these two.
This has logarithmic time complexity.
After this I tried to generalize it to finding kth smallest among two sorted arrays. Here is the question on SO.
Even here the solution given is not obvious to me. But even if I somehow manage to convince myself of this solution, I am still curious as to how to solve the absolute general case (which is my question)
Can somebody explain me a step by step solution (which again in my opinion should take logarithmic time i.e O( log(n1) + log(n2) ... + log(nN) where n1, n2...nN are the lengths of the n arrays) which starts from the more specific cases and moves on to the more general one?
I know similar questions for more specific cases are there all over the internet, but I haven't found a convincing and clear answer.
Here is a link to a question (and its answer) on SO which deals with 5 sorted arrays and finding the median of the combined array. The answer just gets too complicated for me to able to generalize it.
Even clean approaches for the more specific cases (as I mentioned during the post) are welcome.
PS: Do you think this can be further generalized to the case of unsorted arrays?
PPS: It's not a homework problem, I am just preparing for interviews.
This doesn't generalize the links, but does solve the problem:
Go through all the arrays and if any have length > k, truncate to length k (this is silly, but we'll mess with k later, so do it anyway)
Identify the largest remaining array A. If more than one, pick one.
Pick the middle element M of the largest array A.
Use a binary search on the remaining arrays to find the same element (or the largest element <= M).
Based on the indexes of the various elements, calculate the total number of elements <= M and > M. This should give you two numbers: L, the number <= M and G, the number > M
If k < L, truncate all the arrays at the split points you've found and iterate on the smaller arrays (use the bottom halves).
If k > L, truncate all the arrays at the split points you've found and iterate on the smaller arrays (use the top halves, and search for element (k-L).
When you get to the point where you only have one element per array (or 0), make a new array of size n with those data, sort, and pick the kth element.
Because you're always guaranteed to remove at least half of one array, in N iterations, you'll get rid of half the elements. That means there are N log k iterations. Each iteration is of order N log k (due to the binary searches), so the whole thing is N^2 (log k)^2 That's all, of course, worst case, based on the assumption that you only get rid of half of the largest array, not of the other arrays. In practice, I imagine the typical performance would be quite a bit better than the worst case.
It can not be done in less than O(n) time. Proof Sketch If it did, it would have to completely not look at at least one array. Obviously, one array can arbitrarily change the value of the kth element.
I have a relatively simple O(n*log(n)*log(m)) where m is the length of the longest array. I'm sure it is possible to be slightly faster, but not a lot faster.
Consider the simple case where you have n arrays each of length 1. Obviously, this is isomorphic to finding the kth element in an unsorted list of length n. It is possible to find this in O(n), see Median of Medians algorithm, originally by Blum, Floyd, Pratt, Rivest and Tarjan, and no (asymptotically) faster algorithms are possible.
Now the problem is how to expand this to longer sorted arrays. Here is the algorithm: Find the median of each array. Sort the list of tuples (median,length of array/2) and sort it by median. Walk through keeping a sum of the lengths, until you reach a sum greater than k. You now have a pair of medians, such that you know the kth element is between them. Now for each median, we know if the kth is greater or less than it, so we can throw away half of each array. Repeat. Once the arrays are all one element long (or less), we use the selection algorithm.
Implementing this will reveal additional complexities and edge conditions, but nothing that increases the asymptotic complexity. Each step
Finds the medians or the arrays, O(1) each, so O(n) total
Sorts the medians O(n log n)
Walks through the sorted list O(n)
Slices the arrays O(1) each so, O(n) total
that is O(n) + O(n log n) + O(n) + O(n) = O(n log n). And, we must perform this untill the longest array is length 1, which will take log m steps for a total of O(n*log(n)*log(m))
You ask if this can be generalized to the case of unsorted arrays. Sadly, the answer is no. Consider the case where we only have one array, then the best algorithm will have to compare at least once with each element for a total of O(m). If there were a faster solution for n unsorted arrays, then we could implement selection by splitting our single array into n parts. Since we just proved selection is O(m), we are stuck.
You could look at my recent answer on the related question here. The same idea can be generalized to multiple arrays instead of 2. In each iteration you could reject the second half of the array with the largest middle element if k is less than sum of mid indexes of all arrays. Alternately, you could reject the first half of the array with the smallest middle element if k is greater than sum of mid indexes of all arrays, adjust k. Keep doing this until you have all but one array reduced to 0 in length. The answer is kth element of the last array which wasn't stripped to 0 elements.
Run-time analysis:
You get rid of half of one array in each iteration. But to determine which array is going to be reduced, you spend time linear to the number of arrays. Assume each array is of the same length, the run time is going to be cclog(n), where c is the number of arrays and n is the length of each array.
There exist an generalization that solves the problem in O(N log k) time, see the question here.
Old question, but none of the answers were good enough. So I am posting the solution using sliding window technique and heap:
class Node {
int elementIndex;
int arrayIndex;
public Node(int elementIndex, int arrayIndex) {
super();
this.elementIndex = elementIndex;
this.arrayIndex = arrayIndex;
}
}
public class KthSmallestInMSortedArrays {
public int findKthSmallest(List<Integer[]> lists, int k) {
int ans = 0;
PriorityQueue<Node> pq = new PriorityQueue<>((a, b) -> {
return lists.get(a.arrayIndex)[a.elementIndex] -
lists.get(b.arrayIndex)[b.elementIndex];
});
for (int i = 0; i < lists.size(); i++) {
Integer[] arr = lists.get(i);
if (arr != null) {
Node n = new Node(0, i);
pq.add(n);
}
}
int count = 0;
while (!pq.isEmpty()) {
Node curr = pq.poll();
ans = lists.get(curr.arrayIndex)[curr.elementIndex];
if (++count == k) {
break;
}
curr.elementIndex++;
pq.offer(curr);
}
return ans;
}
}
The maximum number of elements that we need to access here is O(K) and there are M arrays. So the effective time complexity will be O(K*log(M)).
This would be the code. O(k*log(m))
public int findKSmallest(int[][] A, int k) {
PriorityQueue<int[]> queue = new PriorityQueue<>(Comparator.comparingInt(x -> A[x[0]][x[1]]));
for (int i = 0; i < A.length; i++)
queue.offer(new int[] { i, 0 });
int ans = 0;
while (!queue.isEmpty() && --k >= 0) {
int[] el = queue.poll();
ans = A[el[0]][el[1]];
if (el[1] < A[el[0]].length - 1) {
el[1]++;
queue.offer(el);
}
}
return ans;
}
If the k is not that huge, we can maintain a priority min queue. then loop for every head of the sorted array to get the smallest element and en-queue. when the size of the queue is k. we get the first k smallest .
maybe we can regard the n sorted array as buckets then try the bucket sort method.
This could be considered the second half of a merge sort. We could simply merge all the sorted lists into a single list...but only keep k elements in the combined lists from merge to merge. This has the advantage of only using O(k) space, but something slightly better than merge sort's O(n log n) complexity. That is, it should in practice operate slightly faster than a merge sort. Choosing the kth smallest from the final combined list is O(1). This is kind of complexity is not so bad.
It can be done by doing binary search in each array, while calculating the number of smaller elements.
I used the bisect_left and bisect_right to make it work for non-unique numbers as well,
from bisect import bisect_left
from bisect import bisect_right
def kthOfPiles(givenPiles, k, count):
'''
Perform binary search for kth element in multiple sorted list
parameters
==========
givenPiles are list of sorted list
count is the total number of
k is the target index in range [0..count-1]
'''
begins = [0 for pile in givenPiles]
ends = [len(pile) for pile in givenPiles]
#print('finding k=', k, 'count=', count)
for pileidx,pivotpile in enumerate(givenPiles):
while begins[pileidx] < ends[pileidx]:
mid = (begins[pileidx]+ends[pileidx])>>1
midval = pivotpile[mid]
smaller_count = 0
smaller_right_count = 0
for pile in givenPiles:
smaller_count += bisect_left(pile,midval)
smaller_right_count += bisect_right(pile,midval)
#print('check midval', midval,smaller_count,k,smaller_right_count)
if smaller_count <= k and k < smaller_right_count:
return midval
elif smaller_count > k:
ends[pileidx] = mid
else:
begins[pileidx] = mid+1
return -1
Please find the below C# code to Find the k-th Smallest Element in the Union of Two Sorted Arrays. Time Complexity : O(logk)
public int findKthElement(int k, int[] array1, int start1, int end1, int[] array2, int start2, int end2)
{
// if (k>m+n) exception
if (k == 0)
{
return Math.Min(array1[start1], array2[start2]);
}
if (start1 == end1)
{
return array2[k];
}
if (start2 == end2)
{
return array1[k];
}
int mid = k / 2;
int sub1 = Math.Min(mid, end1 - start1);
int sub2 = Math.Min(mid, end2 - start2);
if (array1[start1 + sub1] < array2[start2 + sub2])
{
return findKthElement(k - mid, array1, start1 + sub1, end1, array2, start2, end2);
}
else
{
return findKthElement(k - mid, array1, start1, end1, array2, start2 + sub2, end2);
}
}

Given 2 sorted arrays of integers, find the nth largest number in sublinear time [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to find the kth smallest element in the union of two sorted arrays?
This is a question one of my friends told me he was asked while interviewing, I've been thinking about a solution.
Sublinear time implies logarithmic to me, so perhaps some kind of divide and conquer method. For simplicity, let's say both arrays are the same size and that all elements are unique
I think this is two concurrent binary searches on the subarrays A[0..n-1] and B[0..n-1], which is O(log n).
Given sorted arrays, you know that the nth largest will appear somewhere before or at A[n-1] if it is in array A, or B[n-1] if it is in array B
Consider item at index a in A and item at index b in B.
Perform binary search as follows (pretty rough pseudocode, not taking in account 'one-off' problems):
If a + b > n, then reduce the search set
if A[a] > B[b] then b = b / 2, else a = a / 2
If a + b < n, then increase the search set
if A[a] > B[b] then b = 3/2 * b, else a = 3/2 * a (halfway between a and previous a)
If a + b = n then the nth largest is max(A[a], B[b])
I believe worst case O(ln n), but in any case definitely sublinear.
I believe that you can solve this problem using a variant on binary search. The intuition behind this algorithm is as follows. Let the two arrays be A and B and let's assume for the sake of simplicity that they're the same size (this isn't necessary, as you'll see). For each array, we can construct parallel arrays Ac and Bc such that for each index i, Ac[i] is the number of elements in the two arrays that are no larger than A[i] and Bc[i] is the number of elements in the two arrays that are no larger than B[i]. If we could construct these arrays efficiently, then we could find the kth smallest element efficiently by doing binary searches on both Ac and Bc to find the value k. The corresponding entry of A or B for that entry is then the kth largest element. The binary search is valid because the two arrays Ac and Bc are sorted, which I think you can convince yourself of pretty easily.
Of course, this solution doesn't work in sublinear time because it takes O(n) time to construct the arrays Ac and Bc. The question then is - is there some way that we can implicitly construct these arrays? That is, can we determine the values in these arrays without necessarily constructing each element? I think that the answer is yes, using this algorithm. Let's begin by searching array A to see if it has the kth smallest value. We know for a fact that the kth smallest value can't appear in the array in array A after position k (assuming all the elements are distinct). So let's focus just on the the first k elements of array A. We'll do a binary search over these values as follows. Start at position k/2; this is the k/2th smallest element in array A. Now do a binary search in array B to find the largest value in B smaller than this value and look at its position in the array; this is the number of elements in B smaller than the current value. If we add up the position of the elements in A and B, we have the total number of elements in the two arrays smaller than the current element. If this is exactly k, we're done. If this is less than k, then we recurse in the upper half of the first k elements of A, and if this is greater than k we recurse in the lower half of the first elements of k, etc. Eventually, we'll either find that the kth largest element is in array A, in which case we're done. Otherwise, repeat this process on array B.
The runtime for this algorithm is as follows. The search of array A does a binary search over k elements, which takes O(lg k) iterations. Each iteration costs O(lg n), since we have to do a binary search in B. This means that the total time for this search is O(lg k lg n). The time to do this in array B is the same, so the net runtime for the algorithm is O(lg k lg n) = O(lg2 n) = o(n), which is sublinear.
This is quite similar answer to Kirk's.
Let Find( nth, A, B ) be function that returns nth number, and |A| + |B| >= n. This is simple pseudo code without checking if one of array is small, less than 3 elements. In case of small array one or 2 binary searches in larger array is enough to find needed element.
Find( nth, A, B )
If A.last() <= B.first():
return B[nth - A.size()]
If B.last() <= A.first():
return A[nth - B.size()]
Let a and b indexes of middle elements of A and B
Assume that A[a] <= B[b] (if not swap arrays)
if nth <= a + b:
return Find( nth, A, B.first_half(b) )
return Find( nth - a, A.second_half(a), B )
It is log(|A|) + log(|B|), and because input arrays can be made to have n elements each it is log(n) complexity.
int[] a = new int[] { 11, 9, 7, 5, 3 };
int[] b = new int[] { 12, 10, 8, 6, 4 };
int n = 7;
int result = 0;
if (n > (a.Length + b.Length))
throw new Exception("n is greater than a.Length + b.Length");
else if (n < (a.Length + b.Length) / 2)
{
int ai = 0;
int bi = 0;
for (int i = n; i > 0; i--)
{
// find the highest from a or b
if (ai < a.Length)
{
if (bi < b.Length)
{
if (a[ai] > b[bi])
{
result = a[ai];
ai++;
}
else
{
result = b[bi];
bi++;
}
}
else
{
result = a[ai];
ai++;
}
}
else
{
if (bi < b.Length)
{
result = b[bi];
bi++;
}
else
{
// error, n is greater than a.Length + b.Length
}
}
}
}
else
{
// go in reverse
int ai = a.Length - 1;
int bi = b.Length - 1;
for (int i = a.Length + b.Length - n; i >= 0; i--)
{
// find the lowset from a or b
if (ai >= 0)
{
if (bi >= 0)
{
if (a[ai] < b[bi])
{
result = a[ai];
ai--;
}
else
{
result = b[bi];
bi--;
}
}
else
{
result = a[ai];
ai--;
}
}
else
{
if (bi >= 0)
{
result = b[bi];
bi--;
}
else
{
// error, n is greater than a.Length + b.Length
}
}
}
}
Console.WriteLine("{0} th highest = {1}", n, result);
Sublinear of what though? You can't have an algorithm that doesn't check at least n elements, even verifying a solution would require checking that many. But the size of the problem here should surely mean the size of the arrays, so an algorithm that only checks n elements is sublinear.
So I think there's no trick here, start with the list with the smaller starting element and advance until you either:
Reach the nth element, and you're done.
Find the next element is bigger than the next element in the other list, at which point you switch to the other list.
Run out of elements and switch.

Resources