Checking if two substring overlaps in O(n) time - arrays

If I have a string S of length n, and a list of tuples (a,b), where a specifies the staring position of the substring of S and b is the length of the substring. To check if any substring overlaps, we can, for example, mark the position in S whenever it's touched. However, I think this will take O(n^2) time if the list of tuples has a size of n (looping the tuple list, then looping S).
Is it possible to check if any substring actually overlaps with the other in O(n) time?
Edit:
For example, S = "abcde". Tuples = [(1,2),(3,3),(4,2)], representing "ab","cde" and "de". I want to the know an overlap is discovered when (4,2) is read.
I was thinking it is O(n^2) because you get a tuple every time, then you need to loop through the substring in S to see if any character is marked dirty.
Edit 2:
I cannot exit once a collide is detected. Imagine I need to report all the subsequent tuples that collide, so i have to loop through the whole tuple list.
Edit 3:
A high level view of the algorithm:
for each tuple (a,b)
for (int i=a; i <= a+b; i++)
if S[i] is dirty
then report tuple and break //break inner loop only

Your basic approach is correct, but you could optimize your stopping condition, in a way that guarantees bounded complexity in the worst case. Think about it this way - how many positions in S would you have to traverse and mark in the worst case?
If there is no collision, then at worst you'll visit length(S) positions (and run out of tuples by then, since any additional tuple would have to collide). If there is a collision - you can stop at the first marked object, so again you're bounded by the max number of unmarked elements, which is length(S)
EDIT: since you added a requirement to report all colliding tuples, let's calculate this again (extending my comment) -
Once you marked all elements, you can detect collision for every further tuple with a single step (O(1)), and therefore you would need O(n+n) = O(n).
This time, each step would either mark an unmarked element (overall n in the worst case), or identify a colliding tuple (worst O(tuples) which we assume is also n).
The actual steps may be interleaved, since the tuples may be organized in any way without colliding first, but once they do (after at most n tuples which cover all n elements before colliding for the first time), you have to collide every time on the first step. other arrangements may collide earlier even before marking all elements, but again - you're just rearranging the same number of steps.
Worst case example: one tuple covering the entire array, then n-1 tuples (doesn't matter which) -
[(1,n), (n,1), (n-1,1), ...(1,1)]
First tuple would take n steps to mark all elements, the rest would take O(1) each to finish. overall O(2n)=O(n). Now convince yourself that the following example takes the same number of steps -
[(1,n/2-1), (1,1), (2,1), (3,1), (n/2,n/2), (4,1), (5,1) ...(n,1)]

According to your description and comment, the overlap problem may be not about string algorithm, it can be regarded as "segment overlap" problem.
Just use your example, it can be translated to 3 segments: [1, 2], [3, 5], [4, 5]. The question is to check whether the 3 segments have overlap.
Suppose we have m segments each have format [start, end] which means segment start position and end position, one efficient algorithm to detect overlap is to sort them by start position in ascending order, it takes O(m * lgm). Then iterate the sorted m segments, for each segment, try to find whether its end position, here you only need to check:
if(start[i] <= max(end[j], 1 <= j <= i-1) {
segment i is overlap;
}
maxEnd[i] = max(maxEnd[i-1], end[i]); // update max end position of 1 to i
Which each check operation takes O(1). Then the total time complexity is O(m*lgm + m), which can be regarded as O(m*lgm). While for each output, time complexity is related to each tuple's length, which is also related to n.

This is a segment overlap problem and the solution should be possible in O(n) itself if the list of tuples has been sorted in ascending order wrt the first field. Consider the following approach:
Transform the intervals from (start, number of characters) to (start, inclusive_end). Hence the above example becomes: [(1,2),(3,3),(4,2)] ==> [(1, 2), (3, 5), (4, 5)]
The tuples are valid if transformed consecutive tuples (a, b) and (c, d) always follow b < c. Else there is an overlap in the tuples mentioned above.
Each of 1 and 2 can be done in O(n) if the array is sorted in the form mentioned above.

Related

Sum over n-tuples with total sum equal to k

I want to sum over tuples of length n, i.e. I have a vector (m_1,...,m_n) where mi is an integer greater or equal to zero with the constraint that the sum of all vector elements is equal to k.
What is the most efficient way to implement this?
My naive approach would be to iterate through all combinations with m_i between 0 and k and check if they satisfy the criterion, but this seems inefficient.
For instance, if k=2 and n=2, then
(2,0),(1,1),(0,2) would be the possible values of m1,m2 that I would like to have. Is there a way to generate these numbers efficiently (I don't necessarily have to store them all in an array, but I want to iterate over all possible combinations)
Ok, random stuff I deleted.
If you look at FXT book/library by J.Arndt, there is on page 342 section 16.3 "Partition into m parts"
Here is algorithm and reference to the code to generate exactly m-vector of partitioning of n.
You'll probably need to modify it, he doesn't have bins with zeros, starts with ones.
And some thoughts on the matter. n is sum, and you have k bins. Start with |n|0|...|0| combination. Define operation "distribute 1" which is take one from the leftmost bin and distribute it into all other bins.
E.g. D1(|n|0|...|0|)=tuple(|n-1|1|...|0|, ..., |n-1|0|...|1|)
Then you apply D1() to the tuple, and get tuple of tuples. And so on and so forth, till first bin is exhausted.
You could think this as a tree:
root |n|0|...|0|
D1 applied once, k-1 leaves |n-1|1|...|0| ... |n-1|0|...|1|
Next tree level, D1 applied once to previous level, each node getting k-1 children.
THe only thing left is how to traverse it - DFS, BFS, or anything else from https://en.wikipedia.org/wiki/Tree_traversal

Efficient way to search within unsorted array

I have an unsorted array A containing value within range 0 to 100. I have multiple query of format QUERY(starting array index, ending array index, startValue, endValue). I want to return array of indexes whose value lies within startValue and endValue. Naive approach is taking O(n) time for each query and i needed efficient algorithm. Also, query are not known initially.
There are some tradeoffs in terms of memory usage, preprocessing time and query time. Let h be the range of possible values (101 in your case). Ideally you would like your queries to take O(m) time, where m is the number of indexes returned. Here are some approaches.
2-d trees. Each array element V[x] = y corresponds to a 2-d point (x, y). Each query (start, end, min, max) corresponds to a range query in the 2-d tree between those boundaries. This implementation needs O(n) memory, O(n log n) preprocessing time and O(sqrt n + m) time per query (see the complexity section). Notably, this does not depend on h.
Sorted arrays + min-heap (Arguably an easier implementation if you roll your own).
Build h sorted arrays P0...h where Pk is the array of positions where the value k occurs in the original array. This takes O(n) memory and O(n) preprocessing time.
Now we can answer in O(log n) (using binary search) queries of the form next(pos, k): "starting at position pos, where does the next value of k occur?"
To answer a query (start, end, min, max), begin by collecting next(start, min), next(start, min + 1), ..., next(start, max) and build a min-heap with them. This takes O(h log n) time. Then, while the minimum of the heap is at most end, remove it from the heap, add it to the list of indices to return, and add in its place the next element from its corresponding P array. This yields a complexity of O(h log n + m log h) per query.
I have two more ideas based on the linearithmic approach to range minimum queries, but they require O(nh) and O(nh log h) space respectively. The query time is improved to O(m). If that is not prohibitive, please let me know and I will edit the answer to elaborate.

The best order to choose elements in the random array to maximize output?

We have an array as input to production.
R = [5, 2, 8, 3, 6, 9]
If ith input is chosen the output is sum of ith element, the max element whose index is less than i and the min element whose index is greater than i.
For example if I take 8, output would be 8+5+3=16.
Selected items cannot be selected again. So, if I select 8 the next array for next selection would look like R = [5, 2, 3, 6, 9]
What is the order to choose all inputs with maximum output in total? If possible, please send dynamic programming solutions.
I'll start the bidding with an O(n2n) solution . . .
There are a number of ambiguities in your description of the problem, that you have declined to address in comments. None of these ambiguities affects the runtime complexity of this solution, but they do affect implementation details of the solution, so the solution is necessarily somewhat of a sketch.
The solution is as follows:
Create an array results of 2n integers. Each array index i will denote a certain subsequence of the input, and results[i] will be the greatest sum that we can achieve starting with that subsequence.
A convenient way to manage the index-to-subsequence mapping is to represent the first element of the input using the least significant bit (the 1's place), the second element with the 2's place, etc.; so, for example, if our input is [5, 2, 8, 3, 6, 9], then the subsequence 5 2 8 would be represented as array index 0001112 = 7, meaning results[7]. (You can also start with the most significant bit — which is probably more intuitive — but then the implementation of that mapping is a little bit less convenient. Up to you.)
Then proceed in order, from subset #0 (the empty subset) up through subset #2n−1 (the full input), calculating each array-element by seeing how much we get if we select each possible element and add the corresponding previously-stored values. So, for example, to calculate results[7] (for the subsequence 5 2 8), we select the largest of these values:
results[6] plus how much we get if we select the 5
results[5] plus how much we get if we select the 2
results[3] plus how much we get if we select the 8
Now, it might seem like it should require O(n2) time to compute any given array-element, since there are n elements in the input that we could potentially select, and seeing how much we get if we do so requires examining all other elements (to find the maximum among prior elements and the minimum among later elements). However, we can actually do it in just O(n) time by first making a pass from right to left to record the minimal value that is later than each element of the input, and then proceeding from left to right to try each possible value. (Two O(n) passes add up to O(n).)
An important caveat: I suspect that the correct solution only ever involves, at each step, selecting either the rightmost or second-to-rightmost element. If so, then the above solution calculates many, many more values than an algorithm that took that into account. For example, the result at index 1110002 is clearly not relevant in that case. But I can't prove this suspicion, so I present the above O(n2n) solution as the fastest solution whose correctness I'm certain of.
(I'm assuming that the elements are nonnegative absent a suggestion to the contrary.)
Here's an O(n^2)-time algorithm based on ruakh's conjecture that there exists an optimal solution where every selection is from the rightmost two, which I prove below.
The states of the DP are (1) n, the number of elements remaining (2) k, the index of the rightmost element. We have a recurrence
OPT(n, k) = max(max(R(0), ..., R(n - 2)) + R(n - 1) + R(k) + OPT(n - 1, k),
max(R(0), ..., R(n - 1)) + R(k) + OPT(n - 1, n - 1)),
where the first line is when we take the second rightmost element, and the second line is when we take the rightmost. The empty max is zero. The base cases are
OPT(1, k) = R(k)
for all k.
Proof: the condition of choosing from the two rightmost elements is equivalent to the restriction that the element at index i (counting from zero) can be chosen only when at most i + 2 elements remain. We show by induction that there exists an optimal solution satisfying this condition for all i < j where j is the induction variable.
The base case is trivial, since every optimal solution satisfies the vacuous restriction for j = 0. In the inductive case, assume that there exists an optimal solution satisfying the restriction for all i < j. If j is chosen when there are more than j + 2 elements left, let's consider what happens if we defer that choice until there are exactly j + 2 elements left. None of the elements left of j are chosen in this interval by the inductive hypothesis, so they are irrelevant. Choosing the elements right of j can only be at least as profitable, since including j cannot decrease the max. Meanwhile, the set of elements left of j is the same at both times, and the set of the elements right of j is a subset at the later time as compared to the earlier time, so the min does not decrease. We conclude that this deferral does not affect the profitability of the solution.

Sort an array so the difference of elements a[i]-a[i+1]<=a[i+1]-a[i+2]

My mind is blown since I began, last week, trying to sort an array of N elements by condition: the difference between 2 elements being always less or equal to the next 2 elements. For example:
Α[4] = { 10, 2, 7, 4}
It is possible to rearrange that array this way:
{2, 7, 10, 4} because (2 - ­7 = ­-5) < (7 - ­10 = -­3) < (10 - ­4 = 6)
{4, 10, 7, 2} because (4 - ­10 = -­6) < (10 - ­7 = ­3) < (7 - ­2 = 5)
One solution I considered was just shuffling the array and checking each time if it agreed with the conditions, an efficient method for a small number of elements, but time consuming or even impossible for a larger number of elements.
Another was trying to move elements around the array with loops, hoping again to meet the requirements, but again this method is very time consuming and also sometimes not possible.
Trying to find an algorithm doesn't seem to have any result but there must be something.
Thank you very much in advance.
I normally don't just provide code, but this question intrigued me, so here's a brute-force solution, that might get you started.
The concept will always be slow because the individual elements in the list to be sorted are not independent of each other, so they cannot be sorted using traditional O(N log N) algorithms. However, the differences can be sorted that way, which simplifies checking for a solution, and permutations could be checked in parallel to speed up the processing.
import os,sys
import itertools
def is_diff_sorted(qa):
diffs = [qa[i] - qa[i+1] for i in range(len(qa)-1)]
for i in range(len(diffs)-1):
if diffs[i] > diffs[i+1]:
return False
return True
a = [2,4,7,10]
#a = [1,4,6,7,20]
a.sort()
for perm in itertools.permutations(a):
if is_diff_sorted(perm):
print "Solution:",str(a)
break
This condition is related to differentiation. The (negative) difference between neighbouring elements has to be steady or increasing with increasing index. Multiply the condition by -1 and you get
a[i+1]-a[i] => a[i+2]-a[i+1]
or
0 => (a[i+2]-a[i+1])- (a[i+1]-a[i])
So the 2nd derivative has to be 0 or negative, which is the same as having the first derivative stay the same or changing downwards, like e.g. portions of the upper half of a circle. That does not means that the first derivative itself has to start out positive or negative, just that it never change upward.
The problem algorithmically is that it can't be a simple sort, since you never compare just 2 elements of the list, you'll have to compare three at a time (i,i+1,i+2).
So the only thing you know apart from random permutations is given in Klas` answer (values first rising if at all, then falling if at all), but his is not a sufficient condition since you can have a positive 2nd derivative in his two sets (rising/falling).
So is there a solution much faster than the random shuffle? I can only think of the following argument (similar to Klas' answer). For a given vector the solution is more likely if you separate the data into a 1st segment that is rising or steady (not falling) and a 2nd that is falling or steady (not rising) and neither is empty. Likely an argument could be made that the two segments should have approximately equal size. The rising segment should have the data that are closer together and the falling segment should contain data that are further apart. So one could start with the mean, and look for data that are close to it, move them to the first set,then look for more widely spaced data and move them to the 2nd set. So a histogram might help.
[4 7 10 2] --> diff [ 3 3 -8] --> 2diff [ 0 -11]
Here is a solution based on backtracking algorithm.
Sort input array in non-increasing order.
Start dividing the array's values into two subsets: put the largest element to both subsets (this would be the "middle" element), then place second largest one into arbitrary subset.
Sequentially put the remaining elements to either subset. If this cannot be done without violating the "difference" condition, use other subset. If both subsets are not acceptable, rollback and change preceding decisions.
Reverse one of the arrays produced on step 3 and concatenate it with other array.
Below is Python implementation (it is not perfect, the worst defect is recursive implementation: while recursion is quite common for backtracking algorithms, this particular algorithm seems to work in linear time, and recursion is not good for very large input arrays).
def is_concave_end(a, x):
return a[-2] - a[-1] <= a[-1] - x
def append_element(sa, halves, labels, which, x):
labels.append(which)
halves[which].append(x)
if len(labels) == len(sa) or split_to_halves(sa, halves, labels):
return True
if which == 1 or not is_concave_end(halves[1], halves[0][-1]):
halves[which].pop()
labels.pop()
return False
labels[-1] = 1
halves[1].append(halves[0][-1])
halves[0].pop()
if split_to_halves(sa, halves, labels):
return True
halves[1].pop()
labels.pop()
def split_to_halves(sa, halves, labels):
x = sa[len(labels)]
if len(halves[0]) < 2 or is_concave_end(halves[0], x):
return append_element(sa, halves, labels, 0, x)
if is_concave_end(halves[1], x):
return append_element(sa, halves, labels, 1, x)
def make_concave(a):
sa = sorted(a, reverse = True)
halves = [[sa[0]], [sa[0], sa[1]]]
labels = [0, 1]
if split_to_halves(sa, halves, labels):
return list(reversed(halves[1][1:])) + halves[0]
print make_concave([10, 2, 7, 4])
It is not easy to produce a good data set to test this algorithm: plain set of random numbers either is too simple for this algorithm or does not have any solutions. Here I tried to generate a set that is "difficult enough" by mixing together two sorted lists, each satisfying the "difference" condition. Still this data set is processed in linear time. And I have no idea how to prepare any data set that would demonstrate more-than-linear time complexity of this algorithm...
Not that since the diffence should be ever-rising, any solution will have element first in rising order and then in falling order. The length of either of the two "suborders" may be 0, so a solution could consist of a strictly rising or strictly falling sequence.
The following algorithm will find any solutions:
Divide the set into two sets, A and B. Empty sets are allowed.
Sort A in rising order and B in falling order.
Concatenate the two sorted sets: AB
Check if you have a solution.
Do this for all possible divisions into A and B.
Expanding on the #roadrunner66 analysis, the solution is to take two smallest elements of the original array, and make them first and last in the target array; take two next smallest elements and make them second and next-to-last; keep going until all the elements are placed into the target. Notice that which one goes to the left, and which one to the right doesn't matter.
Sorting the original array facilitates the process (finding smallest elements becomes trivial), so the time complexity is O(n log n). The space complexity is O(n), because it requires a target array. I don't know off-hand if it is possible to do it in-place.

How to locate in a huge list of numbers, two numbers where xi=xj?

I have the following question, and it screams at me for a solution with hashing:
Problem :
Given a huge list of numbers, x1........xn where xi <= T, we'd like to know
whether or not exists two indices i,j, where x_i == x_j.
Find an algorithm in O(n) run time, and also with expectancy of O(n), for the problem.
My solution at the moment : We use hashing, where we'll have a mapping function h(x) using chaining.
First - we build a new array, let's call it A, where each cell is a linked list - this would be the destination array.
Now - we run on all the n numbers and map each element in x1........xn, to its rightful place, using the hash function. This would take O(n) run time.
After that we'll run on A, and look for collisions. If we'll find a cell where length(A[k]) > 1
then we return the xi and xj that were mapped to the value that's stored in A[k] - total run time here would be O(n) for the worst case , if the mapped value of two numbers (if they indeed exist) in the last cell of A.
The same approach can be ~twice faster (on average), still O(n) on average - but with better constants.
No need to map all the elements into the hash and then go over it - a faster solution could be:
for each element e:
if e is in the table:
return e
else:
insert e into the table
Also note that if T < n, there must be a dupe within the first T+1 elements, from pigeonhole principle.
Also for small T, you can use a simple array of size T, no hash is needed (hash(x) = x). Initializing T can be done in O(1) to contain zeros as initial values.

Resources