Smallest sum of subarray with sum greater than a given value - arrays

Input: Array of N positive numbers and a value X such that N is small compared to X
Output: Subarray with sum of all its numbers equal to Y > X, such that there is no other subarray with sum of its numbers bigger than X but smaller than Y.
Is there a polynomial solution to this question? If so, can you present it?

As the other answers indicate this is a NP-Complete problem which is called the "Knapsack Problem". So there is no polynomial solution. But it has a pseudo polynomial time algorithm. This explains what pseudo polynomial is.
A visual explanation of the algorithm.
And some code.
If this is work related (I met this problem a few times already, in various disguises) I suggest introducing additional restrictions to simplify it. If it was a general question you may want to check other NP-Complete problems as well. One such list.
Edit 1:
AliVar made a good point. The given problem searches for Y > X and the knapsack problem searches for Y < X. So the answer for this problem needs a few more steps. When we are trying to find the minimum sum where Y > X we are also looking for the maximum sum where S < (Total - X). The second part is the original knapsack problem. So;
Find the total
Solve knapsack for S < (Total - X)
Subtrack the list of items in knapsack solution from the original input.
This should give you the minimum Y > X

Let A be our array. Here is a O(X*N) algorithm:
initialize set S = {0}
initialize map<int, int> parent
best_sum = inf
best_parent = -1
for a in A
Sn = {}
for s in S
t = s + a
if t > X and t < best_sum
best_sum = t
best_parent = s
end if
if t <= X
Sn.add(t)
parent[t] = s
end if
end for
S = S unite with Sn
end for
To print the elements in the best sum print the numbers:
Subarray = {best_sum - best_parent}
t = best_parent
while t in parent.keys()
Subarray.add(t-parent[t])
t = parent[t]
end while
print Subarray
The idea is similar to the idea of dynamic programming. We just calculate all reachable (those that could be obtained as a subarray sum) sums that are less than X. For each element a in the array A you could either choose to participate in the sum or not. At the update step S = S unite with Sn S represent all sums in which a does not participate while Sn all sum in which a do participate.
You could represent S as a boolean array setting a item true if this item is in the set. Note that the length of this boolean array would be at most X.
Overall, the algorithm is O(X*N) with memory usage O(X).

I think this problem is NP-hard and the subset sum can be reduced to it. Here is my reduction:
For an instance of the subset sum with set S={x1,...,xn} it is desired to find a subset with sum t. Suppose d is the minimum distance between two non-equal xi and xj. Build S'={x1+d/n,...,xn+d/n} and feed it to your problem. Suppose that your problem found an answer; i.e. a subset D' of S' with sum Y>t which is the smallest sum with this property. Name the set of original members of D' as D. Three cases may happen:
1) Y = t + |D|*d/n which means D is the solution to the original subset sum problem.
2) Y > t + |D|*d/n which means no answer set can be found for the original problem.
3) Y < t + |D|*d/n. In this case assign t=Y and repeat the problem. Since the value for the new t is increased, this case will not repeat exponentially. Therefore, the procedure terminates in polynomial time.

Related

Maximize number of inversion count in array

We are given an unsorted array A of integers (duplicates allowed) with size N possibly large. We can count the number of pairs with indices i < j, for which A[i] < A[j], let's call this X.
We can change maximum one element from the array with a cost equal to the difference in absolute values (for instance, if we replace element on index k with the new number K, the cost Y is | A[k] - K |).
We can only replace this element with other elements found in the array.
We want to find the minimum possible value of X + Y.
Some examples:
[1,2,2] should return 1 (change the 1 to 2 such that the array becomes [2,2,2])
[2,2,3] should return 1 (change the 3 to 2)
[2,1,1] should return 0 (because no changes are necessary)
[1,2,3,4] should return 6 (this is already the minimum possible value)
[4,4,5,5] should return 3 (this can accomplished by changing the first 4 into a 5 or the last 5 in a 4)
The number of pairs can be found with a naive O(n²) solution, here in Python:
def calc_x(arr):
n = len(arr)
cnt = 0
for i in range(n):
for j in range(i+1, n):
if arr[j] > arr[i]:
cnt += 1
return cnt
A brute-force solution is easily written as for example:
def f(arr):
best_val = calc_x(arr)
used = set(arr)
for i, v in enumerate(arr):
for replacement in used:
if replacement == v:
continue
arr2 = arr[0:i] + replacement + arr[i:]
y = abs(replacement - v)
x = calc_x(arr2)
best_val = min(best_val, x + y)
return best_val
We can count for each element the number of items right of it larger than itself in O(n*log(n)) using for instance an AVL-tree or some variation on merge sort.
However, we still have to search which element to change and what improvement it can achieve.
This was given as an interview question and I would like some hints or insights as how to solve this problem efficiently (data structures or algorithm).
Definitely go for a O(n log n) complexity when counting inversions.
We can see that when you change a value at index k, you can either:
1) increase it, and then possibly reduce the number of inversions with elements bigger than k, but increase the number of inversions with elements smaller than k
2) decrease it (the opposite thing happens)
Let's try not to count x every time you change a value. What do you need to know?
In case 1):
You have to know how many elements on the left are smaller than your new value v and how many elements on the right are bigger than your value. You can pretty easily check that in O (n). So what is your x now? You can count it with the following formula:
prev_val - your previous value
prev_x - x that you've counted at the beginning of your program
prev_l - number of elements on the left smaller than prev_val
prev_r - number of elements on the right bigger than prev_val
v - new value
l - number of elements on the right smaller than v
r - number of elements on the right bigger than v
new_x = prev_x + r + l - prev_l - prev_r
In the second case you pretty much do the opposite thing.
Right now you get something like O( n^3 ) instead of O (n^3 log n), which is probably still bad. Unfortunately that's all what I came up for now. I'll definitely tell you if I come up with sth better.
EDIT: What about memory limit? Is there any? If not, you can just for each element in the array make two sets with elements before and after the current one. Then you can find the amount of smaller/bigger in O (log n), making your time complexity O (n^2 log n).
EDIT 2: We can also try to check, what element would be the best to change to a value v, for every possible value v. You can make then two sets and add/erase elements from them while checking for every element, making the time complexity O(n^2 log n) without using too much space. So the algorithm would be:
1) determine every value v that you can change any element, calculate x
2) for each possible value v:
make two sets, push all elements into the second one
for each element e in array:
add previous element (if there's any) to the first set and erase element e from the second set, then count number of bigger/smaller elements in set 1 and 2 and calculate new x
EDIT 3: Instead of making two sets, you could go with prefix sum for a value. That's O (n^2) already, but I think we can go even better than this.

Possibly simpler O(n) solution to find the Sub-array of length K (or more) with the maximum average

I saw this question on a coding competition site.
Suppose you are given an array of n integers and an integer k (n<= 10^5, 1<=k<=n). How to find the sub-array(contiguous) with maximum average whose length is more than k.
There's an O(n) solution presented in research papers(arxiv.org/abs/cs/0207026.), linked in a duplicate SO question. I'm posting this as a separate question since I think I have a similar method with a simpler explanation. Do you think there's anything wrong with my logic in the solution below?
Here's the logic:
Start with the range of window as [i,j] = [0,K-1]. Then iterate over remaining elements.
For every next element, j, update the prefix sum**. Now we have a choice - we can use the full range [i,j] or discard the range [i:j-k] and keep [j-k+1:j] (i.e keep the latest K elements). Choose the range with the higher average (use prefix sum to do this in O(1)).
Keep track of the max average at every step
Return the max avg at the end
** I calculate the prefix sum as I iterate over the array. The prefix sum at i is the cumulative sum of the first i elements in the array.
Code:
def findMaxAverage(nums, k):
prefix = [0]
for i in range(k):
prefix.append(float(prefix[-1] + nums[i]))
mavg = prefix[-1]/k
lbound = -1
for i in range(k,len(nums)):
prefix.append(prefix[-1] + nums[i])
cavg = (prefix[i+1] - prefix[lbound+1])/(i-lbound)
altavg = (prefix[i+1] - prefix[i-k+1])/k
if altavg > cavg:
lbound = i-k
cavg = altavg
mavg = max(mavg, cavg)
return mavg
Consider k = 3 and sequence
3,0,0,2,0,1,3
Output of your program is 1.3333333333333333. It has found subsequence 0,1,3, but the best possible subsequence is 2,0,1,3 with average 1.5.

Two arrays and number-- best algorithm

This is a question I got in a job interview:
You are given two sorted arrays (sizes n and m), and a number x. What would be the best algorithm to find the indexes of two numbers (one from each array), that their sum equals the given number.
I couldn't find a better answer than the naive solution which is:
Start from the smaller array, from the cell that contains the largest number which is smaller than x.
For each cell in small array. do binary search on the big one, looking for the number so the sum will equal x.
Continue until the first cell of the smaller array, returning the appropriate indexes.
Return FALSE if no such numbers exist.
Can anyone think of a better solution in terms of runtime?
Use two indices i1,i2 - set i1=0, i2=n-1
while i1 < m && i2>=0:
if arr1[i1] + arr2[i2] == SUM:
return i1,i2
else if arr1[i1] + arr2[i2] > SUM:
i2--
else
i1++
return no pair found
The idea is to use the fact that the array is sorted, so start from the two edges of the arrays, and at each iteration, make changes so you will get closer to the desired element
Complexity is O(n+m) under worst case analysis, which is better than binary search approach if min{m,n} >= log(max{m,n})
Proof of correctness (guidelines):
Assume the answer is true with indices k1,k2.
Then, for each i2>k2- arr1[k1] + arr2[i2] > SUM - and you will NOT change i1 after reaching it before getting to i2==k2. Similarly you can show that when you get to i2==k2, you will NOT change i2 before you get i1==k1.
Since we linearly scan the arrays - one of i1 or i2 will get to k1 or k2 at some point, and then - you will continue iterating until you set the other iterator to the correct location, and find the answer.
QED
Notes:
If you want to output ALL elements that matches the desired sum, when arr1[i1]+arr2[i2] ==SUM, change the element with the LOWER absolute difference to the next element in the iteration order. It will make sure you output all desired pairs.
Note that this solution might fail for duplicate elements. As is, the solution works if there is no pair (x,y) such that x AND y both have dupes.
To handle this case, you will need to 'go back up' once you have exhausted all possible pairs in one direction, and the pseudo code should be updated to:
dupeJump = -1
while i1 < m && i2>=0:
if arr1[i1] + arr2[i2] == SUM:
yield i1,i2
if arr1[i1+1] == arr1[i1] AND arr2[i2-1] == arr2[i2]:
//remembering where we were in case of double dupes
if (dupeJump == -1):
dupeJump = i2
i2--
else:
if abs(arr1[i1+1] - arr1[i1]) < abs(arr2[i2-1] - arr2[i2]):
i1++
else:
i2--
//going back up, because there are more pairs to print due to dupes
if (dupeJump != -1):
i2 = dupeJump
dupeJump = -1
else if arr1[i1] + arr2[i2] > SUM:
i2--
else
i1++
Note however that the time complexity might increase to O(n+m+size(output)), because there could O(n*m) such pairs and you need to output all of them (note that every correct solution will have this restriction).

Find all pairs (x, y) in a sorted array so that x + y < z

This is an interview question. Given a sorted integer array and number z find all pairs (x, y) in the array so that x + y < z. Can it be done better than O(n^2)?
P.S. I know that we can find all pairs (x, y | x + y == z) in O(N).
You cannot necessarily find all such pairs in O(n) time, because there might be O(n2) pairs of values that have this property. In general, an algorithm can't take any less time to run than the number of values that it produces.
Hope this helps!
In generate, no it can't. Consider the case where x + y < z for all x, y in the array. You have to touch (e.g. display) all of the n(n - 1)/2 possible pairs in the set. This is fundamentally O(n^2).
If you are asked to output all pairs that satisfy that property, I don't think there is anything better than O(N^2) since there can be O(N^2) pairs in the output.
But this is also true for x + y = z, for which you claim there is a O(N) solution - so I might be missing something.
I suspect the original question asked for the number of pairs. In that case, it can be done in O(N log(N)). For each element x find out y = z - x and do a binary search for y in the array. The position of y gives the number of pairs that can be formed with that particular value of x. Summing this over all values in the array gives you the answer. There are N values and finding the number if pairs for each takes O(log(N)) (binary search), so the whole thing is O(N log(N)).
You can find them in O(N), if you add the additional constraint that each element is unique.
After finding all of the x+y==z pairs, you know that for every x and y that satisfies that condition, every x or y (choose one) that is at a lower index than its pair satisfies the x+y < z condition.
Actually selecting these and outputting them would take O(n^2), but in a sense, the x+y==z pairs are a compressed form of the answer, together with the input.
(You can preprocess the input to a form where each element is unique, together with a counter for number of occurrences. This would take O(N) time. You can generalize this solution to unsorted arrays, increasing the time to O(nlogn).)
The justification for saying that finding the pairs in under the time linearly proportional to the size of the solution: Suppose the question is "what are the integers that are between 0 and the given input K"?
Because it is a sorted integer array, you could use the Binary search algorithm, so the best is O(N), and the worst is O(N*logN), the average case is also O(N*logN).
You can sort the array and for every element that little than z, use binary-search - total O(NlogN).
Total run-time : O(|P| + NlogN), where P is the resulting pairs.
There actually exists an O(nlogn) solution to this question.
What I would do (after checking first if I'm allowed to do that) is to define the output format of my algorithm/function.
I would define it as a sequence of elements (S,T). S - Position of element in the array (or its value). T - Position of the sub-array [0,T]. So for example, if T=3, it means that element S combined with elements 0,1,2 and 3 satisfy the desired condition.
The total result of this is O(nlogn) run time, and O(n) memory.

Why is the average number of steps for finding an item in an array N/2?

Could somebody explain why the average number of steps for finding an item in an unsorted array data-structure is N/2?
This really depends what you know about the numbers in the array. If they're all drawn from a distribution where all the probability mass is on a single value, then on expectation it will take you exactly 1 step to find the value you're looking for, since every value is the same, for example.
Let's now make a pretty strong assumption, that the array is filled with a random permutation of distinct values. You can think of this as picking some arbitrary sorted list of distinct elements and then randomly permuting it. In this case, suppose you're searching for some element in the array that actually exists (this proof breaks down if the element is not present). Then the number of steps you need to take is given by X, where X is the position of the element in the array. The average number of steps is then E[X], which is given by
E[X] = 1 Pr[X = 1] + 2 Pr[X = 2] + ... + n Pr[X = n]
Since we're assuming all the elements are drawn from a random permutation,
Pr[X = 1] = Pr[X = 2] = ... = Pr[X = n] = 1/n
So this expression is given by
E[X] = sum (i = 1 to n) i / n = (1 / n) sum (i = 1 to n) i = (1 / n) (n)(n + 1) / 2
= (n + 1) / 2
Which, I think, is the answer you're looking for.
The question as stated is just wrong. Linear search may perform better.
Perhaps a simpler example that shows why the average is N/2 is this:
Assume you have an unsorted array of 10 items: [5, 0, 9, 8, 1, 2, 7, 3, 4, 6]. This is all the digits [0..9].
Since the array is unsorted (i.e. you know nothing about the order of the items), the only way you can find a particular item in the array is by doing a linear search: start at the first item and go until you find what you're looking for, or you reach the end.
So let's count how many operations it takes to find each item. Finding the first item (5) takes only one operation. Finding the second item (0) takes two. Finding the last item (6) takes 10 operations. The total number of operations required to find all 10 items is 1+2+3+4+5+6+7+8+9+10, or 55. The average is 55/10, or 5.5.
The "linear search takes, on average, N/2 steps" conventional wisdom makes a number of assumptions. The two biggest are:
The item you're looking for is in the array. If an item isn't in the array, then it takes N steps to determine that. So if you're often looking for items that aren't there, then your average number of steps per search is going to be much higher than N/2.
On average, each item is searched for approximately as often as any other item. That is, you search for "6" as often as you search for "0", etc. If some items are looked up significantly more often than others, then the average number of steps per search is going to be skewed in favor of the items that are searched for more frequently. The number will be higher or lower than N/2, depending on the positions of the most frequently looked-up items.
While I think templatetypedef has the most instructive answer, in this case there is a much simpler one.
Consider permutations of the set {x1, x2, ..., xn} where n = 2m. Now take some element xi you wish to locate. For each permutation where xi occurs at index m - k, there is a corresponding mirror image permutation where xi occurs at index m + k. The mean of these possible indices is just [(m - k) + (m + k)]/2 = m = n/2. Therefore the mean of all all possible permutations of the set is n/2.
Consider a simple reformulation of the question:
What would be the limit of
lim (i->inf) of (sum(from 1 to i of random(n)) /i)
Or in C:
int sum = 0, i;
for (i = 0; i < LARGE_NUM; i++) sum += random(n);
sum /= LARGE_NUM;
If we assume that our random have even distribution of values (each value from 1 to n is equally likely to be produced), then the expected result would be (1+n)/2.

Resources