Algorithm to check overlap between arrays - arrays

I have two arrays, one of existing appointments and one of potential appointments. The arrays each contain from/to values of the existing or potential appointments. The appointments in each array are already sorted by the from time.
I need to check each of the potential appointments against each of the existing appointments to see that there is no overlap. I know I can start from the beginning of the existing appointments each time but I am looking for a more efficient way.

The idea: start comparing the first intervals to each other. If one interval comes entirely before the other, look at the next interval until you find one that overlaps or comes after. Either interval A comes entirely before interval B, or B comes entirely before A, or they overlap somehow. Once you find an overlap, you can quit looking. This can be made to easily return the earliest overlapping pair, but returning all overlapping pairs would require more work.
Pseudocode:
Overlaps(actual[1..n], pending[1..m])
i = 1
j = 1
while i <= n and j <= m do
if actual[i].stop <= pending[j].start then
i = i + 1
else if actual[i].start >= pending[j].stop then
j = j + 1
else
return true
return false
Note - if you want to find all overlapping pairs, instead of quitting after detecting the first overlap, you could just print out i and j and increment i if actual[i].stop <= pending[j].stop or increment j if actual[i].stop > pending[j].stop. That would end up printing every overlapping pair and is still linear time.

This can be done efficiently in O(nlogn). Consider two arrays A, B containing existing and potential appointments respectively. Sort A in increasing order of ending times of appointments (A_end) and starting time of appointments (A_start). This takes O(nlogn) time
For each potential appointment in B:
s = starting point of the assignment
t = ending point of the assignment
Now, binary search on array A_start and A_end to find all the appointments that fall between s-t taking o(logn) time.
[# Overlaps =
(appointments with ending time <= t) - (appointments with ending time < s) +
(appointments with ending time > t) - (appointments with starting time > t) +
]
Thus, overall order is O(nlogn)
EDIT: #overlaps = sum_1 + sum_2
Here sum_1 represents those intervals with ending time <= t. But again to only find the overlapping intervals we have to subtract those intervals with ending time < s. Thus, we get only those with ending time >=s and <=t.
Here sum_2 represents those intervals with ending time > t. But again to only find the overlapping intervals we have to subtract those intervals with ending time > t. Thus we get only those with ending time >t but starting time <=t.
Proof can be given by the fact that any overlapping interval can either have ending time <=t or >t. Thus it will either lie in sum_1 or sum_2.

You can take the union of the the existing and potential appointments into a single array, and sort the union by start time. Add a label to the intervals so you know it was an existing or a potential interval. (You can also sort them in separate arrays and increment two indices, but the code is simpler with one list).
You can then loop through the combined array, and merge together neighboring intervals if they overlap. Only merge existing appointments with existing, and likewise, potential appointments with potential. For this, you have to memorize the most recent existing and potential intervals.
This way, you do not need to go back to the very beginning, you only need to look at the most recently merged intervals.
In psedudocode:
E: existing appointments
P: potential appointments
A: union of P and E, sorted by start time
lastE = []
lastP = []
for each appointment a in A:
if a is existing:
if a overlaps with lastE:
lastE = lastE + [a]
else
lastE = [a]
if a overlaps with lastP:
print all appointments in lastP overlapping with a
if a is potential:
if a overlaps with lastE:
print a
if a overlaps with lastP:
lastP = lastP + [a]
else:
lastP = [a]
Note, that you need not store the structure of lastE, you can define it as a single interval and adjust the start and end time.
You need to know the individual appointments in lastP though. You can possibly optimize it even further by maintaining a descending order by end time in lastP. Then, on the line when all overlaps between a and lastP are printed, you can stop looking once the end time of the potential appointment in lastP is smaller than the start time of a.

if we first join these two arrays and the time required for joining is O(n) and then we sort the whole array and this sorting required O(nlogn) if we use quick sort or merge sort then if we compute the total time complexity, it will be like this
F(n) = O(n) + O(nlogn)
so final complexity will be O(nlogn) which is laser than O(n^2)

Related

Fastest method of computing every possible combination of four array choices?

I am working with four MATLAB arrays of size 169x14, 207x14, 94x14, and 108x14. I would like to produce a single array which has the linear addition of every possible row combination of the four arrays. For example, one such combination may be the 99th row of array1, the 72nd row of array2, 6th row of array3, and 27th row of array 4 added together as a single row. These arrays are named helm, chest, arm, leg - this is for a stat calculator of a video game.
My first attempt at this was the following:
for i = 1:length(lin_helm)
for k = 1:length(lin_arm)
for j = 1:length(lin_leg)
for g = 1:length(lin_leg)
armor_comb = [armor_comb;
i j k g helm_array(i,2:15)+chest_array(j,2:15)+arm_array(k,2:15)+leg_array(g,2:15)];
end
end
end
end
Which uses nested for loops for each array and simply adds the rows together (note that 'lin_X' are just numbered vectors for the row number and the rows of the array are 2:15 because the first column is a row iterator). The first four columns of this result array can be ignored, they are just denoting which rows were selected from the other arrays. To say the least, this is extremely slow.
I then tried omitting the last for loop to instead take the first three selections and add them as an entire matrix to the entire last array. This was done by taking the addition of the first three row selections and using a matrix of ones. I chose to do this for the largest array, chest, to save the most time.
for i = 1:length(lin_helm)
for k = 1:length(lin_arm)
for j = 1:length(lin_leg)
armor_comb = [armor_comb;
i*ones(length(lin_chest),1) j*ones(length(lin_chest),1) k*ones(length(lin_chest),1) lin_chest' ones(length(lin_chest),14).*[helm_array(i,2:15)+leg_array(j,2:15)+arm_array(k,2:15)]+chest_array(:,2:15)];
end
end
end
This was significantly faster, but still extremely slow compared to the total array size needed.
I am not sure how to make this process faster by using matrix math. To generalize my issue, I am trying to find the numerical array of all possible row additions of an AxN, BxN, CxN, and DxN where any given selection takes one row from each array with no repeats.
All online documentation I can find just says to use nested for loops because they assume your array sizes are small. This is unpractical for my application, so I am seeking help on how to use matrices (or another method) to speed up computation time.
For making indexes (the first columns of your final matrix), you can try something like this:
function i=indexes(i1, i2)
i=[kron(i1, ones(size(i2, 1), 1)) kron(ones(size(i1, 1), 1), i2)];
end
If a and b are column vectors of indexes 1, 2, ..., then indexes(a, b) will be the pairs of index combos, and you can repeat for additional indexing columns, e.g., indexes(indexes(a, b), c).
If you have the indexes, say ii, you can add up what you want with something like
array1(ii(:, 1), 2:15) + array2(ii(:, 2), 2:15)
Prepend with ii if you really need to.
This will be much faster than a naive loop like you have initially. E.g., on my somewhat old Matlab, this:
n=10;
a=(1:2*n)';
b=(1:3*n)';
c=(1:5*n)';
tic
ii=indexes(indexes(a,b),c);
toc
tic
jj=[];
k=1;
for i1=1:length(a)
for i2=1:length(b)
for i3=1:length(c)
jj(k, :)=[i1 i2 i3];
k=k+1;
end
end
end
toc
gives
Elapsed time is 0.003514 seconds.
Elapsed time is 0.754066 seconds.
If you pre-allocate the storage for the loop case like jj=zeros(size(ii));, that's also significantly faster, though still slower than the kron-based approach, like with n=100:
Elapsed time is 3.323197 seconds.
Elapsed time is 9.825276 seconds.

Number of events in one array within w minutes after any event in a second array

I have two sorted arrays of unix time stamps (so integers representing times at which some events happen). Lets call the arrays ts1 and ts2. I want to find the number of events in ts1 that lie after w-minutes of any event in ts2. Let's say the method signature is (take the first and second arrays and window size then return number of events in ts1 that are within w minutes after any event in ts2):
critical_events(ts1,ts2,w)->int
Here are some test cases:
## Test cases.
ev = critical_events([.5,1.5,2.5],[1,2,3],.5)
print(ev==0)
ev = critical_events([1.4,1.4,2.7],[1,2,3],.5)
print(ev==2)
ev = critical_events([1.4,2.4,3.4],[1,2,3],.5)
print(ev==3)
I expect the length of the first array, n to be much larger than the length of the second one, m. Looking for efficient algorithms in terms of time and space and if possible, their average and worst case complexities in terms of n and m, time and space.
My attempt: instead of explaining my attempts, I'll just link to the code which should be self-explanatory (or at least better than what I can do in words): https://gist.github.com/ryu577/fdc22af4ed17d122a6aa25684597745b
You are showing them as sorted, so my assumption is they are (need to be for this to work).
Because your first array is much larger than your second, you need to take your second in a for loop.
I am using example test case 2:ev = critical_events([1.4,1.4,2.7],[1,2,3],.5)
Next you can use a binary search on the first element of ts2 + interval (1 + 0.5) = 1.5.
Your startIndex is 0 and endIndex is 2. So in first compare you take all elements.
Doing a binary search will result in index 2 in ts1. Note: Because you have equal element in your array, you need to go right until you get higher number. What you can tell now is that 2.7 (and all elements after if there where any) are the element what lies after 1.5. Count is ts2.lenght - foundindex.
Now you can set your start index to 2. because you know, all on the left of this index is smaller and will not lie after 1.5 sec.
You take element2 and do a binary search, you will find index 2 ( 2.5 < 2.7), again:
Count = Count + ts2.lenght - foundindex.
To my knowledge, this is the fastest method. I believe the speed is Log(n).m.

Efficient way to search within unsorted array

I have an unsorted array A containing value within range 0 to 100. I have multiple query of format QUERY(starting array index, ending array index, startValue, endValue). I want to return array of indexes whose value lies within startValue and endValue. Naive approach is taking O(n) time for each query and i needed efficient algorithm. Also, query are not known initially.
There are some tradeoffs in terms of memory usage, preprocessing time and query time. Let h be the range of possible values (101 in your case). Ideally you would like your queries to take O(m) time, where m is the number of indexes returned. Here are some approaches.
2-d trees. Each array element V[x] = y corresponds to a 2-d point (x, y). Each query (start, end, min, max) corresponds to a range query in the 2-d tree between those boundaries. This implementation needs O(n) memory, O(n log n) preprocessing time and O(sqrt n + m) time per query (see the complexity section). Notably, this does not depend on h.
Sorted arrays + min-heap (Arguably an easier implementation if you roll your own).
Build h sorted arrays P0...h where Pk is the array of positions where the value k occurs in the original array. This takes O(n) memory and O(n) preprocessing time.
Now we can answer in O(log n) (using binary search) queries of the form next(pos, k): "starting at position pos, where does the next value of k occur?"
To answer a query (start, end, min, max), begin by collecting next(start, min), next(start, min + 1), ..., next(start, max) and build a min-heap with them. This takes O(h log n) time. Then, while the minimum of the heap is at most end, remove it from the heap, add it to the list of indices to return, and add in its place the next element from its corresponding P array. This yields a complexity of O(h log n + m log h) per query.
I have two more ideas based on the linearithmic approach to range minimum queries, but they require O(nh) and O(nh log h) space respectively. The query time is improved to O(m). If that is not prohibitive, please let me know and I will edit the answer to elaborate.

Checking if two substring overlaps in O(n) time

If I have a string S of length n, and a list of tuples (a,b), where a specifies the staring position of the substring of S and b is the length of the substring. To check if any substring overlaps, we can, for example, mark the position in S whenever it's touched. However, I think this will take O(n^2) time if the list of tuples has a size of n (looping the tuple list, then looping S).
Is it possible to check if any substring actually overlaps with the other in O(n) time?
Edit:
For example, S = "abcde". Tuples = [(1,2),(3,3),(4,2)], representing "ab","cde" and "de". I want to the know an overlap is discovered when (4,2) is read.
I was thinking it is O(n^2) because you get a tuple every time, then you need to loop through the substring in S to see if any character is marked dirty.
Edit 2:
I cannot exit once a collide is detected. Imagine I need to report all the subsequent tuples that collide, so i have to loop through the whole tuple list.
Edit 3:
A high level view of the algorithm:
for each tuple (a,b)
for (int i=a; i <= a+b; i++)
if S[i] is dirty
then report tuple and break //break inner loop only
Your basic approach is correct, but you could optimize your stopping condition, in a way that guarantees bounded complexity in the worst case. Think about it this way - how many positions in S would you have to traverse and mark in the worst case?
If there is no collision, then at worst you'll visit length(S) positions (and run out of tuples by then, since any additional tuple would have to collide). If there is a collision - you can stop at the first marked object, so again you're bounded by the max number of unmarked elements, which is length(S)
EDIT: since you added a requirement to report all colliding tuples, let's calculate this again (extending my comment) -
Once you marked all elements, you can detect collision for every further tuple with a single step (O(1)), and therefore you would need O(n+n) = O(n).
This time, each step would either mark an unmarked element (overall n in the worst case), or identify a colliding tuple (worst O(tuples) which we assume is also n).
The actual steps may be interleaved, since the tuples may be organized in any way without colliding first, but once they do (after at most n tuples which cover all n elements before colliding for the first time), you have to collide every time on the first step. other arrangements may collide earlier even before marking all elements, but again - you're just rearranging the same number of steps.
Worst case example: one tuple covering the entire array, then n-1 tuples (doesn't matter which) -
[(1,n), (n,1), (n-1,1), ...(1,1)]
First tuple would take n steps to mark all elements, the rest would take O(1) each to finish. overall O(2n)=O(n). Now convince yourself that the following example takes the same number of steps -
[(1,n/2-1), (1,1), (2,1), (3,1), (n/2,n/2), (4,1), (5,1) ...(n,1)]
According to your description and comment, the overlap problem may be not about string algorithm, it can be regarded as "segment overlap" problem.
Just use your example, it can be translated to 3 segments: [1, 2], [3, 5], [4, 5]. The question is to check whether the 3 segments have overlap.
Suppose we have m segments each have format [start, end] which means segment start position and end position, one efficient algorithm to detect overlap is to sort them by start position in ascending order, it takes O(m * lgm). Then iterate the sorted m segments, for each segment, try to find whether its end position, here you only need to check:
if(start[i] <= max(end[j], 1 <= j <= i-1) {
segment i is overlap;
}
maxEnd[i] = max(maxEnd[i-1], end[i]); // update max end position of 1 to i
Which each check operation takes O(1). Then the total time complexity is O(m*lgm + m), which can be regarded as O(m*lgm). While for each output, time complexity is related to each tuple's length, which is also related to n.
This is a segment overlap problem and the solution should be possible in O(n) itself if the list of tuples has been sorted in ascending order wrt the first field. Consider the following approach:
Transform the intervals from (start, number of characters) to (start, inclusive_end). Hence the above example becomes: [(1,2),(3,3),(4,2)] ==> [(1, 2), (3, 5), (4, 5)]
The tuples are valid if transformed consecutive tuples (a, b) and (c, d) always follow b < c. Else there is an overlap in the tuples mentioned above.
Each of 1 and 2 can be done in O(n) if the array is sorted in the form mentioned above.

Time Complexity of Insertion and Selection sort When there are only two key values in an array

I am reviewing Algorithm, 4th Editon by sedgewick recently, and come across such a problem and cannot solve it.
The problem goes like this:
2.1.28 Equal keys. Formulate and validate hypotheses about the running time of insertion
sort and selection sort for arrays that contain just two key values, assuming that
the values are equally likely to occur.
Explanation: You have n elements, each can be 0 or 1 (without loss of generality), and for each element x: P(x=0)=P(x=1).
Any help will be welcomed.
Selection sort:
The time complexity is going to remain the same (as it is without the 2 keys assumption), it is independent on the values of the arrays, only the number of elements.
Time complexity for selection sort in this case is O(n^2)
However, this is true only for the original algorithm that scans the entire tail of the array for each outer loop iteration. if you optimize it to find the next "0", at iteration i, since you have already "cleared" the first i-1 zeros, the i'th zero mean location is at index 2i. This means each time, the inner loop will need to do 2i-(i-1)=i+1 iterations.
Suming it up will be:
1 + 2 + ... + n = n(n+1)/2
Which is, unfortunately, still in O(n^2).
Another optimization could be to "remmber" where you have last stopped. This will significantly improve complexity to O(n), since you don't need to traverse the same element more than once - but that's going to be a different algorithm, not selection sort.
Insertion Sort:
Here, things are more complicated. Note that in the inner loop (taken from wikipedia), the number of operations depends on the values:
while j > 0 and A[j-1] > x
However, recall that in insertion sort, after the ith step, the first i elements are sorted. Since we are assuming P(x=0)=P(x=1), an average of i/2 elements are 0's and i/2 are 1's.
This means, the time complexity on average, for the inner loop is O(i/2).
Summing this up will get you:
1/2 + 2/2 + 3/2 + ... + n/2 = 1/2* (1+2+...+n) = 1/2*n(n+1)/2 = n(n+1)/4
The above is however, still in O(n^2).
The above is not a formal proof, because it implicitly uses E(f(E(x)) = E(f(x)), which is not true, but it can give you guidelines how to formally build your proof.
Well obviosuly you only need to search until you find the first 0, when searching for the next smmalest. For example, in the selection sort, you scan the array looking for the next smallest number to swap into the current position. Since there are only 0s and 1s you can stop the scan when encountering the first 0 (since it is the next smallest number), so there is no need to continue scanning the rest of the array in this cycle. If 0 is not found then the sorting is complete, since the "unsorted" portion is all 1s.
Insertion sort is basically the same. They are both O(N) in this case.

Resources