Most Efficient Algorithm to Align an Multiple Ordered Sequences - arrays

I have a strange feeling this is a very easy problem to solve but I'm not finding a good way of doing this without using brute force or dynamic programming. Here it goes:
Given N arrays of ordered and monotonic values, find the set of positions for each array i1, i2 ... in that minimises pair-wise difference of values at those indexes between all arrays. In other words, find the positions for all arrays whose values are closest to each other. Multiple solutions may exist and arrays may or may not be equally sized.
If A denotes the list of all arrays, the pair-wise difference is given by the sum of absolute differences between all values at the given indexes between all different arrays, as so:
An example, 3 arrays a, b and c:
a = [20 29 30 32 33]
b = [28 29 30 32 33]
c = [10 12 28 31 32 33]
The best alignment for this array would be a[3] b[3] c[4] or a[4] b[4] c[5], because (32,32,32) and (33,33,33) are all equal values and have, therefore minimum pairwise difference between each other. (Assuming array index starts at 0)
This is a common problem in bioinformatics thats usually solved with Dynamic Programming, but due to the fact this is an ordered sequence, I think there's somehow a way of exploiting this notion of order. I first thought about doing this pairwise, but this does not guarantee the global optimum because the best local answer might not be the best global answer.
This is meant to be language agnostic, but I don't really mind an answer for a specific language, as long as there is no loss of generality. I know Dynamic Programming is an option here, but I have a feeling there's an easier way to do this?

The tricky thing is parsing the arrays so that at some point you're guaranteed to be considering the set of indices that realize the pairwise min. Using a min heap on the values doesn't work. Counterexample with 4 arrays: [0,5], [1,2], [2], [2]. We start with a d(0,1,2,2) = 7, optimal is d(0,2,2,2) = 6, but the min heap moves us from 7 to d(5,1,2,2) = 12, then d(5,2,2,2) = 9.
I believe (but haven't proved) that if we alway increment the index that improves pairwise distance the most (or degrades it the least), we're guaranteed to visit every local min and the global min.
Assuming n total elements across k arrays:
Simple approach: we repeatedly get the pairwise distance deltas (delta wrt. incrementing each index), increment the best one, and any time doing so switch us from improvement to degradation (i.e. a local minimum) we calculate the pairwise distance. All this is O(k^2) per increment for a total running time of O((n-k) * (k^2)).
With O(k^2) storage, we could keep an array where (i,j) stores the pairwise distance delta achieve by increment the index of array i wrt. array j. We also store the column sums. Then on incrementing an index we can update the appropriate row & column & column sums in O(k). This gives us a running time of O((n-k)*k)

To just complete Dave's answer, here is the pseudocode of the delta algorithm:
initialise index_table to 0's where each row i denotes the index for the ith array
initialise delta_table with the corresponding cost of incrementing index of ith array and keeping the other indexes at their current values
cur_cost <- cost of current index table
best_cost <- cur_cost
best_solutions <- list with the current index table
while (can_at_least_one_index_increase)
i <- index whose delta is lowest
increment i-th entry of the index_table
if cost(index_table) < cur_cost
cur_cost = cost(index_table)
best_solutions = {} U {index_table}
if cost(index_table) = cur_cost
best_solutions = best_solutions U {index_table}
update delta_table
Important Note: During an iteration, some index_table entries might have already reached the maximum value for that array. Whenever updating the delta_table, it is necessary to never pick those values, otherwise this will result in a Array Out of Bounds,Segmentation Fault or undefined behaviour. A neat trick is to simply check which indexes are already at max and set a sufficiently large value, so they are never picked. If no index can increase anymore, the loop will end.
Here's an implementation in Python:
def align_ordered_sequences(arrays: list):
def get_cost(index_table):
n = len(arrays)
if n == 1:
return 0
sum = 0
for i in range(0, n-1):
for j in range(i+1, n):
v1 = arrays[i][index_table[i]]
v2 = arrays[j][index_table[j]]
sum += math.sqrt((v1 - v2) ** 2)
return sum
def compute_delta_table(index_table):
# Initialise the delta table: we switch each index element to 1, call
# the cost method and then revert the change, this avoids having to
# create copies, which decreases performance unnecessarily
delta_table = []
for i in range(n):
if index_table[i] + 1 >= len(arrays[i]):
# Implementation detail: if the index is outside the bounds of
# array i, choose a "large enough" number
delta_table.append(999999999999999)
else:
index_table[i] = index_table[i] + 1
delta_table.append(get_cost(index_table))
index_table[i] = index_table[i] - 1
return delta_table
def can_at_least_one_index_increase(index_table):
answer = False
for i in range(len(arrays)):
if index_table[i] < len(arrays[i]) - 1:
answer = True
return answer
n = len(arrays)
index_table = [0] * n
delta_table = compute_delta_table(index_table)
best_solutions = [index_table.copy()]
cur_cost = get_cost(index_table)
best_cost = cur_cost
while can_at_least_one_index_increase(index_table):
i = delta_table.index(min(delta_table))
index_table[i] = index_table[i] + 1
new_cost = get_cost(index_table)
# A new best solution was found
if new_cost < cur_cost:
cur_cost = new_cost
best_solutions = [index_table.copy()]
# A new solution with the same cost was found
elif new_cost == cur_cost:
best_solutions.append(index_table.copy())
# Update the delta table
delta_table = compute_delta_table(index_table)
return best_solutions
And here are some examples:
>>> print(align_ordered_sequences([[0,5], [1,2], [2], [2]]))
[[0, 1, 0, 0]]
>> print(align_ordered_sequences([[3, 5, 8, 29, 40, 50], [1, 4, 14, 17, 29, 50]]))
[[3, 4], [5, 5]]
Note 2: this outputs indexes not the actual values of each array.

Related

Find Minimum Score Possible

Problem statement:
We are given three arrays A1,A2,A3 of lengths n1,n2,n3. Each array contains some (or no) natural numbers (i.e > 0). These numbers denote the program execution times.
The task is to choose the first element from any array and then you can execute that program and remove it from that array.
For example:
if A1=[3,2] (n1=2),
A2=[7] (n2=1),
A3=[1] (n3=1)
then we can execute programs in various orders like [1,7,3,2] or [7,1,3,2] or [3,7,1,2] or [3,1,7,2] or [3,2,1,7] etc.
Now if we take S=[1,3,2,7] as the order of execution the waiting time of various programs would be
for S[0] waiting time = 0, since executed immediately,
for S[1] waiting time = 0+1 = 1, taking previous time into account, similarly,
for S[2] waiting time = 0+1+3 = 4
for S[3] waiting time = 0+1+3+2 = 6
Now the score of array is defined as sum of all wait times = 0 + 1 + 4 + 6 = 11, This is the minimum score we can get from any order of execution.
Our task is to find this minimum score.
How can we solve this problem? I tried with approach trying to pick minimum of three elements each time, but it is not correct because it gets stuck when two or three same elements are encountered.
One more example:
if A1=[23,10,18,43], A2=[7], A3=[13,42] minimum score would be 307.
The simplest way to solve this is with dynamic programming (which runs in cubic time).
For each array A: Suppose you take the first element from array A, i.e. A[0], as the next process. Your total cost is the wait-time contribution of A[0] (i.e., A[0] * (total_remaining_elements - 1)), plus the minimal wait time sum from A[1:] and the rest of the arrays.
Take the minimum cost over each possible first array A, and you'll get the minimum score.
Here's a Python implementation of that idea. It works with any number of arrays, not just three.
def dp_solve(arrays: List[List[int]]) -> int:
"""Given list of arrays representing dependent processing times,
return the smallest sum of wait_time_before_start for all job orders"""
arrays = [x for x in arrays if len(x) > 0] # Remove empty
#functools.lru_cache(100000)
def dp(remaining_elements: Tuple[int],
total_remaining: int) -> int:
"""Returns minimum wait time sum when suffixes of each array
have lengths in 'remaining_elements' """
if total_remaining == 0:
return 0
rem_elements_copy = list(remaining_elements)
best = 10 ** 20
for i, x in enumerate(remaining_elements):
if x == 0:
continue
cost_here = arrays[i][-x] * (total_remaining - 1)
if cost_here >= best:
continue
rem_elements_copy[i] -= 1
best = min(best,
dp(tuple(rem_elements_copy), total_remaining - 1)
+ cost_here)
rem_elements_copy[i] += 1
return best
return dp(tuple(map(len, arrays)), sum(map(len, arrays)))
Better solutions
The naive greedy strategy of 'smallest first element' doesn't work, because it can be worth it to do a longer job to get a much shorter job in the same list done, as the example of
A1 = [100, 1, 2, 3], A2 = [38], A3 = [34],
best solution = [100, 1, 2, 3, 34, 38]
by user3386109 in the comments demonstrates.
A more refined greedy strategy does work. Instead of the smallest first element, consider each possible prefix of the array. We want to pick the array with the smallest prefix, where prefixes are compared by average process time, and perform all the processes in that prefix in order.
A1 = [ 100, 1, 2, 3]
Prefix averages = [(100)/1, (100+1)/2, (100+1+2)/3, (100+1+2+3)/4]
= [ 100.0, 50.5, 34.333, 26.5]
A2=[38]
A3=[34]
Smallest prefix average in any array is 26.5, so pick
the prefix [100, 1, 2, 3] to complete first.
Then [34] is the next prefix, and [38] is the final prefix.
And here's a rough Python implementation of the greedy algorithm. This code computes subarray averages in a completely naive/brute-force way, so the algorithm is still quadratic (but an improvement over the dynamic programming method). Also, it computes 'maximum suffixes' instead of 'minimum prefixes' for ease of coding, but the two strategies are equivalent.
def greedy_solve(arrays: List[List[int]]) -> int:
"""Given list of arrays representing dependent processing times,
return the smallest sum of wait_time_before_start for all job orders"""
def max_suffix_avg(arr: List[int]):
"""Given arr, return value and length of max-average suffix"""
if len(arr) == 0:
return (-math.inf, 0)
best_len = 1
best = -math.inf
curr_sum = 0.0
for i, x in enumerate(reversed(arr), 1):
curr_sum += x
new_avg = curr_sum / i
if new_avg >= best:
best = new_avg
best_len = i
return (best, best_len)
arrays = [x for x in arrays if len(x) > 0] # Remove empty
total_time_sum = sum(sum(x) for x in arrays)
my_averages = [max_suffix_avg(arr) for arr in arrays]
total_cost = 0
while True:
largest_avg_idx = max(range(len(arrays)),
key=lambda y: my_averages[y][0])
_, n_to_remove = my_averages[largest_avg_idx]
if n_to_remove == 0:
break
for _ in range(n_to_remove):
total_time_sum -= arrays[largest_avg_idx].pop()
total_cost += total_time_sum
# Recompute the changed array's avg
my_averages[largest_avg_idx] = max_suffix_avg(arrays[largest_avg_idx])
return total_cost

Finding max sum with operation limit

As an input i'm given an array of integers (all positive).
Also as an input i`m given a number of "actions". The goal is to find max possible sum of array elements with given number of actions.
As an "action" i can either:
Add current element to sum
Move to the next element
We are starting at 0 position in array. Each element could be added only once.
Limitation are:
2 < array.Length < 20
0 < number of "actions" < 20
It seems to me that this limitations essentially not important. Its possible to find each combination of "actions", but in this case complexity would be like 2^"actions" and this is bad...))
Examples:
array = [1, 4, 2], 3 actions. Output should be 5. In this case we added zero element, moved to first element, added first element.
array = [7, 8, 9], 2 actions. Output should be 8. In this case we moved to the first element, then added first element.
Could anyone please explain me the algorithm to solve this problem? Or at least the direction in which i shoudl try to solve it.
Thanks in advance
Here is another DP solution using memoization. The idea is to represent the state by a pair of integers (current_index, actions_left) and map it to the maximum sum when starting from the current_index, assuming actions_left is the upper bound on actions we are allowed to take:
from functools import lru_cache
def best_sum(arr, num_actions):
'get best sum from arr given a budget of actions limited to num_actions'
#lru_cache(None)
def dp(idx, num_actions_):
'return best sum starting at idx (inclusive)'
'with number of actions = num_actions_ available'
# return zero if out of list elements or actions
if idx >= len(arr) or num_actions_ <= 0:
return 0
# otherwise, decide if we should include current element or not
return max(
# if we include element at idx
# we spend two actions: one to include the element and one to move
# to the next element
dp(idx + 1, num_actions_ - 2) + arr[idx],
# if we do not include element at idx
# we spend one action to move to the next element
dp(idx + 1, num_actions_ - 1)
)
return dp(0, num_actions)
I am using Python 3.7.12.
array = [1, 1, 1, 1, 100]
actions = 5
In example like above, you just have to keep moving right and finally pickup the 100. At the beginning of the array we never know what values we are going to see further. So, this can't be greedy.
You have two actions and you have to try out both because you don't know which to apply when.
Below is a python code. If not familiar treat as pseudocode or feel free to convert to language of your choice. We recursively try both actions until we run out of actions or we reach the end of the input array.
def getMaxSum(current_index, actions_left, current_sum):
nonlocal max_sum
if actions_left == 0 or current_index == len(array):
max_sum = max(max_sum, current_sum)
return
if actions_left == 1:
#Add current element to sum
getMaxSum(current_index, actions_left - 1, current_sum + array[current_index])
else:
#Add current element to sum and Move to the next element
getMaxSum(current_index + 1, actions_left - 2, current_sum + array[current_index])
#Move to the next element
getMaxSum(current_index + 1, actions_left - 1, current_sum)
array = [7, 8, 9]
actions = 2
max_sum = 0
getMaxSum(0, actions, 0)
print(max_sum)
You will realize that there can be overlapping sub-problems here and we can avoid those repetitive computations by memoizing/caching the results to the sub-problems. I leave that task to you as an exercise. Basically, this is Dynamic Programming problem.
Hope it helped. Post in comments if any doubts.

What's the most efficient way to manipulate Arrays in Ruby

Im trying to make the following code more efficient but ran out of ideas, so looking for some help:
I receive an interval x and an Array of Integers space, I need to separate the space array into segments of x interval so for example if
space = [1, 2, 3, 1, 2]
x = 2
# The intervals Would be
[1, 2][2, 3][3, 1][1, 2]
So Then I need to find the minimum on each interval and then find the Max value of the minimums, so in the example above the minimums would be:
[1, 2, 1, 1]
Then I need to return the max value of the minimums that would be 2
Below is my solution, most test cases pass but some of them are failing because the time execution is exceeded. Any idea about how to make the following code more efficient?
def segment(x, space)
minimums = {}
space.each_with_index do |_val, index|
# 1 Create the segment
mark = x - 1
break if mark >= space.length
segment = space[index..mark]
x = x + 1
# 2 Find the min of the segment
minimum = segment.min
minimums[index] = minimum
end
# Compare the maximum of the minimums
return minimums.values.max
end
I tried making the minimums a hash instead of an array but didn't work. Also I thought I would convert the space array to a hash and manipulate the a hash instead of an Array, but I think that would be even more complicated...
You don’t need to collect the segments and process them later. Just walk the array once, keeping the largest min of each segment as you go.
Also each_with_index is a waste. Just walk the array yourself taking successive slices.
Simple example:
# initial conditions
space = [1, 2, 3, 1, 2]
x = 2
# here we go
min = space.min
(0..space.length-x).each do |i|
min = [space[i,x].min, min].max
end
puts min
As usual in ruby, there's always a more elegant way. For example, no need to create the slices; let ruby do it (each_cons):
# initial conditions
space = [1, 2, 3, 1, 2]
x = 2
# here we go
min = space.min
space.each_cons(x) do |slice|
min = [slice.min, min].max
end
puts min
And of course once you understand how this works, it can be refined even further (e.g. as shown in the comment below).

Count elements in 1st array less than or equal than elements in 2nd array python

I have an array Aof 21381120 elements ranking from [0,1]. I need to construct a new array B in which the element i contains the number of elements in A less than or equal than A[i].
My attempt:
A = np.random.random(10000) # for reproducibility
g = np.sort(A)
B = [np.sum(g<=element) for element in A]
I am still using a for loop, taking too much time. Since I have to do this several times I was wondering if exists a better way to do it.
EDIT
I gave an example of the array A for reproducibility. This does what is expected to. But I need it to be faster (for arrays having 2e9 elements).
For instance if:
A = [0.1,0.01,0.3,0.5,1]
I expect the output to be
B = [2, 1, 3, 4, 5]
You could use binary search to speed up searching in a sorted array. Binary search in numpy.
A = np.random.rand(10000) # for reproducibility
g = np.sort(A)
B = [np.searchsorted(g, element) for element in A]
Looks like sorting is the way to go because in a sorted array A, the number of elements less than or equal to A[i] is almost i + 1.
However, if an element is repeated, you'll have to look at the nearest element that's to the right of A[i]:
A = [1,2,3,4,4,4,5,6]
^^^^^ A[3] == A[4] == A[5]
Here, the number of elements <= A[3] is 3 + <number of repeated 4's>. Maybe you could roll your own sorting algorithm that would keep track of such repetitions. Or count the repetitions before sorting the array.
Then the final formula would be:
N(<= A[k]) = k + <number of elements equal to A[k]>
So the speed of your code would mainly depend on the speed of the sorting algorithm.

You have an array of integers, and for each index you want to find the product of every integer except the integer at that index

I was going over some interview questions and came across this one at a website. I have come up with a solution in Ruby, and I wish to know if it is efficient and an acceptable solution. I have been coding for a while now, but never concentrated on complexity of a solution before. Now I am trying to learn to minimize the time and space complexity.
Question:
You have an array of integers, and for each index you want to find the product of every integer except the integer at that index.
Example:
arr = [1,2,4,5]
result = [40, 20, 10, 8]
# result = [2*4*5, 1*4*5, 1*2*5, 1*2*4]
With that in mind, I came up with this solution.
Solution:
def find_products(input_array)
product_array = []
input_array.length.times do |iteration|
a = input_array.shift
product_array << input_array.inject(:*)
input_array << a
end
product_array
end
arr = find_products([1,7,98,4])
From what I understand, I am accessing the array as many times as its length, which is considered to be terrible in terms of efficiency and speed. I am still unsure on what is the complexity of my solution.
Any help in making it more efficient is appreciated and if you can also tell the complexity of my solution and how to calculate that, it will be even better.
Thanks
def product_of_others(arr)
case arr.count(0)
when 0
total = arr.reduce(1,:*)
arr.map { |n| total/n }
when 1
ndx_of_0 = arr.index(0)
arr.map.with_index do |n,i|
if i==ndx_of_0
arr[0,ndx_of_0].reduce(1,:*) * arr[ndx_of_0+1..-1].reduce(1,:*)
else
0
end
end
else
arr.map { 0 }
end
end
product_of_others [1,2,4,5] #=> [40, 20, 10, 8]
product_of_others [1,-2,0,5] #=> [0, 0, -10, 0]
product_of_others [0,-2,4,5] #=> [-40, 0, 0, 0]
product_of_others [1,-2,4,0] #=> [0, 0, 0, -8]
product_of_others [1,0,4,0] #=> [0, 0, 0, 0]
product_of_others [] #=> []
For the case where arr contains no zeroes I used arr.reduce(1,:*) rather than arr.reduce(:*) in case the array is empty. Similarly, if arr contains one zero, I used .reduce(1,:*) in case the zero was at the beginning or end of the array.
For inputs not containing zeros (for others, see below)
Easiest (and relatively efficient) to me seems to first get the total product:
total_product = array.inject(1){|product, number| product * number}
And then map each array element to the total_product divided by the element:
result = array.map {|number| total_product / number}
After initial calculation of total_product = 1*2*4*5 this will calculate
result = [40/1, 40/2, 40/4, 40/5]
As far as I remember this sums up to O(n) [creating total product: touch each number once] + O(n) [creating one result per number: touch each number once]. (correct me if i am wrong)
Update
As #hirolau and #CarySwoveland pointed out, there is a problem if you have (exactly 1) 0 in the input, thus:
For inputs containing zeros (workaroundish, but borrows performance benefit and complexity class)
zero_count = array.count{|number| number == 0}
if zero_count == 0
# as before
elsif zero_count == 1
# one zero in input, result will only have 1 non-zero
nonzero_array = array.reject{|n| n == 0}
total_product = nonzero_array.inject(1){|product, number| product * number}
result = array.map do |number|
(number == 0) ? total_product : 0
end
else
# more than one zero? All products will be zero!
result = array.map{|_| 0}
end
Sorry that this answer by now basically equals #CarySwoveland, but I think my code is more explicit.
Look at the comments about further performance considerations.
Here is how I would do it:
arr = [1,2,4,5]
result = arr.map do |x|
new_array = arr.dup # Create a copy of original array
new_array.delete_at(arr.index(x)) # Remove an instance of the current value
new_array.inject(:*) # Return the product.
end
p result # => [40, 20, 10, 8]
I not know ruby, but, accessing an array is O(1), that means that is in constant time, so the complexity of your algorithm is O(n), it is very good. I don't think that a better solution can be found in terms of complexity. The real speed is another issue, but that solution is fine

Resources