Longest Common Subsequence for Multiple Sequences - c

I have done a bunch of research for finding the longest for M = 2 sequences, but I am trying to figure out how to do it for M ≥ 2 sequences
I am being given N and M: M sequences, with N unique elements. N is the set of {1 - N}. I have thought about the dynamic programming approach, but I am still confused as to how to actually incorporate it.
Example input
5 3
5 3 4 1 2
2 5 4 3 1
5 2 3 1 4
The max sequence here can be seen to be
5 3 1
Expected output
Length = 3

A simple idea.
For each number i between 1 and N, calculate the longest subsequence where the last number is i. (Let's call it a[i])
To do that, we'll iterate over numbers i in the first sequence from start to end. If a[i] > 1, then there's number j such that in each sequence it comes before i.
Now we can just check all possible values of j and (if previous condition holds) do a[i] = max(a[i], a[j] + 1).
As the last bit, because j comes before i in first sequence, it means a[j] is already calculated.
for each i in first_sequence
// for the OP's example, 'i' would take values [5, 3, 4, 1, 2], in this order
a[i] = 1;
for each j in 1..N
if j is before i in each sequence
a[i] = max(a[i], a[j] + 1)
end
end
end
It's O(N^2*M), if you calculate matrix of positions beforehand.

Since you have unique elements, #Nikita Rybak's answer is the one to go with, but since you mentioned dynamic programming, here's how you'd use DP when you have more than two sequences:
dp[i, j, k] = length of longest common subsequence considering the prefixes
a[1..i], b[1..j], c[1..k].
dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if a[i] = b[j] = c[k]
= max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise
To get the actual subsequence back, use a recursive function that starts from dp[a.Length, b.Length, c.Length] and basically reverses the above formulas: if the three elements are equal, backtrack to dp[a.Length - 1, b.Length - 1, c.Length - 1] and print the character. If not, backtrack according to the max of the above values.

You can look into "Design of a new Deterministic Algorithm for finding Common DNA Subsequence" paper. You can use this algorithm to construct the DAG (pg 8, figure 5). From the DAG, read the largest common distinct subsequences. Then try a dynamic programming approach on that using the value of M to decide how many DAGs you need to construct per sequence. Basically use these subsequences as key and store the corresponding sequence numbers where it is found and then try to find the largest subsequence (which can be more than 1).

Related

insertion sort theoretical analysis, total number of shifts.

Given the following array:
[14 17 21 34 47 19 71 22 29 41 8]
and the following excerpt from the book Algorithms Unlocked by Thomas Cormen
(slightly edited, [START] and [STOP] flags are not part of the text):
Insertion sort is an excellent choice when the array starts out as
''almost sorted''. [START] Suppose that each array element starts out within
k positions of where it ends up in the sorted array. Then the total
number of times that a given element is shifted over all iterations
of the inner loop is at most k. Therefore, the total number of times
that all elements are shifted over all inner-loop iterations, is at
most kn, which in turn tells us that the total number of inner-loop
iterations is at most kn (since each inner-loop iteration shifts
exactly one element by one position).[STOP] If k is a constant, then the
total running time of insertion sort would he only Θ(n), because the
Θ-notation subsumes the constant factor k. In fact we can even
tolerate some elements moving a long distance in the array, as long as
there are not too many such elements. In particular, if L elements can
move anywhere in the array (so that each of these elements can move by
up to n-1 positions), and the remaining n - L elements can more at
most k positions, then the total number of shifts is at most L * (n –
1) + (n – L) * k = (k + L) * n – (k + 1) * L, which is Θ(n) if both k
and L are constants.
The books is trying to explain how it crafts a formula, which it presents at the bottom of the text. I would like some help to better understand what it says, very likely, it could help a specific example using the above sample array, so that what is going on with the k and n variables. Can you help me to better understand the above excerpt's analysis?
To be more specific what is confusing me, the lines between [START] and [STOP] flags ,these are the lines:
Suppose that each array element..... which in turn tells us that the
total number of inner-loop iterations is at most kn(since each
inner-loop iteration shifts exactly one element by one position).
(anything below these lines is totally understood all the way to the end.)
Let is consider the insertion sort algorithm
Algorithm: InsertionSort(A)
i ← 1
while i < length(A)
j ← i
while j > 0 and A[j-1] > A[j]
swap A[j] and A[j-1]
j ← j - 1
end while
i ← i + 1
end while
The inner loop - move elements of A[0..i-1] one by one, till A[i] is in its correct position.
Therefore if a given element is atmost k position away from its correct place, we will have a maximum of k compares and swaps. For n elements it will be k*n.
Hope it helps!

Shortest unique subsequence that distinguish a set of strings

I have N strings of bits (each of size M) X1[0..M], ..., XN[0..M]. I need the pseudocode/algorithm to find the smallest length subsequence (not necessarily contiguous) that is unique in each given string. For example,
If the strings are 011011, 011000, 010010 , the subsequence [2,4] (11, 10, 01) is different in each string. Or the subsequence [2, 4, 5] (111, 100, 010) . Or the subsequence [4, 5] (11, 00, 10).
But not the subsequence [0, 1, 5] (011, 010, 010) ---> Not unique in each string.
EDIT : 1 <= M <= 1000, 2 <= N <= 10.
EDIT : Currently, my solution is this :
The minimum length of subsequence will range between ceil(log2(N)) and N-1.
So, the pseudocode will be :
for i = ceil(log2(N)) to N-1 :
check all subsequence of size i
if any subsequence distinguish all N strings, return i
The first step can be done by generating all combinations mCi.
The second step can be done by extracting the subsequence for all N strings and checking if all of them are distinct.
But this algorithm is currently exponential complexity. I wanted to know if a better algorithm is possible.
EDIT : No, It isn't homework. It was asked in an interview.
i think something like this would work:
first:
create matriz A (mxm) and array B(m)
for each bit i from right to left, compute de decimal value of j word in A[i][j]
//that means A[i][j] holds the decimal value of word j until the i bit
in the same loop B[i] will hold if bit i from all words are the same.
if B[i] = true, it means that we dont need to look that position, cos it says nothing.
create deque D//to check if there is equal elements
create array C(m)
for each position P in [0...M] where B[i] = false :
for each bit i = P ... 0
for each word j
C[j] = C[j]*2 + word[j][i] //word[j][i] = word j in bit i
bool finished = true;
for each e in C:
if(D.count(e) > 0) {
finished = false;
break;
}
else{
D.add(e)
}
}
if(finished) return range(P...i);
D.clear()
not possible;
what this algorithm does is: starting from useful positions, it starts creating value for words from them, and in the moment you are able to add all of them in the deque (all of them are different), you are done finding a range where they differ (range is P - i + 1 sized).
You have to run this anyway for all i where B[i] = false, so in the worst case it should run about n³.
Note that there are some optimizations that can be done knowing the number of strings and their size, for example: if there are 10 strings of size 3, you know its impossible to distinguish (cos there arent different 10 different binaries of size 3). Given the number of strings, you can search only for (contiguous or not) sizes ceil(log(number of strings)). For example, 5 words cant differ in one bit, also they cant differ in 2 bits, but with 3 bits they can differ.

Optimal way to find number of operation required to convert all K numbers to lie in the range [L,R] (i.e. L≤x≤R)

I am solving this question which requires some optimized techniques to
solve it. I can think of the brute force method only which requires
combinatorics.
Given an array A consisting of n integers. We call an integer "good"
if it lies in the range [L,R] (i.e. L≤x≤R). We need to make sure if we
pick up any K integers from the array at least one of them should be a
good integer.
For achieving this, in a single operation, we are allowed to
increase/decrease any element of the array by one.
What will be the minimum number of operations we will need for a
fixed k?"
i.e k=1 to n.
input:
L R
1 2
A=[ 1 3 3 ]
output:
for k=1 : 2
for k=2 : 1
for k=3 : 0
For k=1, you have to convert both the 3s into 2s to make sure that if
you select any one of the 3 integers, the selected integer is good.
For k=2, one of the possible ways is to convert one of the 3s into 2.
For k=3, no operation is needed as 1 is a good integer.
As burnpanck has explained in his answer, to make sure that when you pick any k elements in the array, and at least one of them is in range [L,R], we need to make sure that there are at least n - k + 1 numbers in range [L,R] in the array.
So, first , for each element, we calculate the cost to make this element be a valid element (which is in range [L,R]) and store those cost in an array cost.
We notice that:
For k = 1, the minimum cost is the sum of array cost.
For k = 2, the minimum cost is the sum of cost, minus the largest element.
For k = 3, the minimum cost is the sum of cost, minus two largest elements.
...
So, we need to have a prefixSum array, which ith position is the sum of sorted cost array from 0 to ith.
After calculate prefixSum, we can answer result for each k in O(1)
So here is the algo in Java, notice the time complexity is O(n logn):
int[]cost = new int[n];
for(int i = 0; i < n; i++)
cost[i] = //Calculate min cost for element i
Arrays.sort(cost);
int[]prefix = new int[n];
for(int i = 0; i < n; i++)
prefix[i] = cost[i] + (i > 0 ? prefix[i - 1] : 0);
for(int i = n - 1; i >= 0; i--)
System.out.println("Result for k = " + (n - i) + " is " + prefix[i]);
To be sure that from picking k elements will give at least one valid means you should have not more than k-1 invalid in your set. You therefore need to find the shortest way to make enough elements valid. This I would do as follows: In a single pass, generate a map that counts how many elements are in the set that need $n$ operations to be made valid. Then, you clearly want to take those elements that need the least operations, so take the required number of elements in ascending order of required number of operations, and sum the number of operations.
In python:
def min_ops(L,R,A_set):
n_ops = dict() # create an empty mapping
for a in A_set: # loop over all a in the set A_set
n = max(0,max(a-R,L-a)) # the number of operations requied to make a valid
n_ops[n] = n_ops.get(n,0) + 1 # in the mapping, increment the element keyed by *n* by ones. If it does not exist yet, assume it was 0.
allret = [] # create a new list to hold the result for all k
for k in range(1,len(A_set)+1): # iterate over all k in the range [1,N+1) == [1,N]
n_good_required = len(A_set) - k + 1
ret = 0
# iterator over all pairs of keys,values from the mapping, sorted by key.
# The key is the number of ops required, the value the number of elements available
for n,nel in sorted(n_ops.items()):
if n_good_required:
return ret
ret += n * min(nel,n_good_required)
n_good_required -= nel
allret.append(ret) # append the answer for this k to the result list
return allret
As an example:
A_set = [1,3,3,6,8,5,4,7]
L,R = 4,6
For each A, we find how many operations we need to make it valid:
n = [3,1,1,0,2,0,0,1]
(i.e. 1 needs 3 steps, 3 needs one, and so on)
Then we count them:
n_ops = {
0: 3, # we already have three valid elements
1: 3, # three elements that require one op
2: 1,
3: 1, # and finally one that requires 3 ops
}
Now, for each k, we find out how many valid elements we need in the set,
e.g. for k = 4, we need at most 3 invalid in the set of 8, so we need 5 valid ones.
Thus:
ret = 0
n_good_requied = 5
with n=0, we have 3 so take all of them
ret = 0
n_good_required = 2
with n=1, we have 3, but we need just two, so take those
ret = 2
we're finished

Is there an O(n) algorithm to generate a prefix-less array for an positive integer array?

For array [4,3,5,1,2],
we call prefix of 4 is NULL, prefix-less of 4 is 0;
prefix of 3 is [4], prefix-less of 3 is 0, because none in prefix is less than 3;
prefix of 5 is [4,3], prefix-less of 5 is 2, because 4 and 3 are both less than 5;
prefix of 1 is [4,3,5], prefix-less of 1 is 0, because none in prefix is less than 1;
prefix of 2 is [4,3,5,1], prefix-less of 2 is 1, because only 1 is less than 2
So for array [4, 3, 5, 1, 2], we get prefix-less arrary of [0,0, 2,0,1],
Can we get an O(n) algorithm to get prefix-less array?
It can't be done in O(n) for the same reasons a comparison sort requires O(n log n) comparisons. The number of possible prefix-less arrays is n! so you need at least log2(n!) bits of information to identify the correct prefix-less array. log2(n!) is O(n log n), by Stirling's approximation.
Assuming that the input elements are always fixed-width integers you can use a technique based on radix sort to achieve linear time:
L is the input array
X is the list of indexes of L in focus for current pass
n is the bit we are currently working on
Count is the number of 0 bits at bit n left of current location
Y is the list of indexs of a subsequence of L for recursion
P is a zero initialized array that is the output (the prefixless array)
In pseudo-code...
Def PrefixLess(L, X, n)
if (n == 0)
return;
// setup prefix less for bit n
Count = 0
For I in 1 to |X|
P(I) += Count
If (L(X(I))[n] == 0)
Count++;
// go through subsequence with bit n-1 with bit(n) = 1
Y = []
For I in 1 to |X|
If (L(X(I))[n] == 1)
Y.append(X(I))
PrefixLess(L, Y, n-1)
// go through subsequence on bit n-1 where bit(n) = 0
Y = []
For I in 1 to |X|
If (L(X(I))[n] == 0)
Y.append(X(I))
PrefixLess(L, Y, n-1)
return P
and then execute:
PrefixLess(L, 1..|L|, 32)
I think this should work, but double check the details. Let's call an element in the original array a[i] and one in the prefix array as p[i] where i is the ith element of the respective arrays.
So, say we are at a[i] and we have already computed the value of p[i]. There are three possible cases. If a[i] == a[i+1], then p[i] == p[i+1]. If a[i] < a[i+1], then p[i+1] >= p[i] + 1. This leaves us with the case where a[i] > a[i+1]. In this situation we know that p[i+1] >= p[i].
In the naïve case, we go back through the prefix and start counting items less than a[i]. However, we can do better than that. First, recognize that the minimum value for p[i] is 0 and the maximum is i. Next look at the case of an index j, where i > j. If a[i] >= a[j], then p[i] >= p[j]. If a[i] < a[j], then p[i] <= p[j] + j . So, we can start going backwards through p updating the values for p[i]_min and p[i]_max. If p[i]_min equals p[i]_max, then we have our solution.
Doing a back of the envelope analysis of the algorithm, it has O(n) best case performance. This is the case where the list is already sorted. The worst case is where it is reversed sorted. Then the performance is O(n^2). The average performance is going to be O(k*n) where k is how much one needs to backtrack. My guess is for randomly distributed integers, k will be small.
I am also pretty sure there would be ways to optimize this algorithm for cases of partially sorted data. I would look at Timsort for some inspiration on how to do this. It uses run detection to detect partially sorted data. So the basic idea for the algorithm would be to go through the list once and look for runs of data. For ascending runs of data you are going to have the case where p[i+1] = p[i]+1. For descending runs, p[i] = p_run[0] where p_run is the first element in the run.

Zero sum minimal subarray [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Zero sum SubArray
An array contains both positive and negative elements, find the
subarray whose sum equals 0.
This is an interview question.
Unfortunately, I cannot read the accepted answer to this question, so I am asking it again: how to find the minimal integer subarray with zero sum?
Note, this is not a "zero subset problem". The obvious brute-force solution is O(N^2) (loop over all subarrays). Can we solve it in O(N)?
This algorithm will find them all, you can easily modify it to find the minimal subarray.
Given an int[] input array, you can create an int[] tmp array where tmp[i] = tmp[i - 1] + input[i]; so that at each element of tmp will store the sum of the input up to that element.
Now if you check tmp, you'll notice that there might be values that are equal to each other. Let's say that this values are at indexes j an k with j < k, then the subarray with sum 0 will be from index j + 1 to k. NOTE: if j + 1 == k, then k is 0 and that's it! ;)
NOTE: The algorithm should consider a virtual tmp[-1] = 0;
The implementation can be done in different ways including using a HashMap as suggested by BrokenGlass but be careful with the special case in the NOTE above.
Example:
int[] input = {4, 6, 3, -9, -5, 1, 3, 0, 2}
int[] tmp = {4, 10, 13, 4, -1, 0, 3, 3, 5}
Note the value 4 in tmp at index 0 and 3 ==> sum tmp 1 to 3 = 0, length (3 - 1) + 1 = 4
Note the value 0 in tmp at index 5 ==> sum tmp 0 to 5 = 0, length (5 - 0) + 1 = 6
Note the value 3 in tmp at index 6 and 7 ==> sum tmp 7 to 7 = 0, length (7 - 7) + 1 = 1
An array contains both positive and negative elements, find the
subarray whose sum equals 0.
Yes that can be done in O(n). If the sum of the elements within a subarray equals zero that means the sum of elements up to the first element before the sub array is the same as the sum of elements up to the last element in the subarray.
Go through the array and for each element K put the sum up to K and the index K in a hashtable, if the sum up to the current element exists already check the index of that element and the current element, if the delta is lower than the minimum subarray length, update the minimum. Update the hashtable with (sum, current index K).

Resources