Efficient way to search within unsorted array - arrays

I have an unsorted array A containing value within range 0 to 100. I have multiple query of format QUERY(starting array index, ending array index, startValue, endValue). I want to return array of indexes whose value lies within startValue and endValue. Naive approach is taking O(n) time for each query and i needed efficient algorithm. Also, query are not known initially.

There are some tradeoffs in terms of memory usage, preprocessing time and query time. Let h be the range of possible values (101 in your case). Ideally you would like your queries to take O(m) time, where m is the number of indexes returned. Here are some approaches.
2-d trees. Each array element V[x] = y corresponds to a 2-d point (x, y). Each query (start, end, min, max) corresponds to a range query in the 2-d tree between those boundaries. This implementation needs O(n) memory, O(n log n) preprocessing time and O(sqrt n + m) time per query (see the complexity section). Notably, this does not depend on h.
Sorted arrays + min-heap (Arguably an easier implementation if you roll your own).
Build h sorted arrays P0...h where Pk is the array of positions where the value k occurs in the original array. This takes O(n) memory and O(n) preprocessing time.
Now we can answer in O(log n) (using binary search) queries of the form next(pos, k): "starting at position pos, where does the next value of k occur?"
To answer a query (start, end, min, max), begin by collecting next(start, min), next(start, min + 1), ..., next(start, max) and build a min-heap with them. This takes O(h log n) time. Then, while the minimum of the heap is at most end, remove it from the heap, add it to the list of indices to return, and add in its place the next element from its corresponding P array. This yields a complexity of O(h log n + m log h) per query.
I have two more ideas based on the linearithmic approach to range minimum queries, but they require O(nh) and O(nh log h) space respectively. The query time is improved to O(m). If that is not prohibitive, please let me know and I will edit the answer to elaborate.

Related

Why is the complexity of merging M sorted arrays linear time?

Suppose we want to perform external sort and have M number of blocks sorted, where each block contains k comparable items such that n = Mk. k in this case also refers to the maximum number of items you can fit into memory for sorting, and n is the total number of items to sort.
Then using the merge function in merge sort, each element will have to be compared against all other elements from other blocks, which gives me O(M) comparison for one element. Since we have to do this for all elements, we will have O(M * Mk) = O(M^2 * k) = O(nM) time complexity.
This seems to be linear at first, but suppose in the worst case we can only fit 1 item into memory. So we have M=n blocks, and the time complexity is O(n^2) directly. How does the merging gives you linear time in external sort?
Also, in the case where k = 1, how is the sorting even feasible when there cannot be any comparisons done?
Make priority queue based? for example, on binary heap, fill it with current items (or their indexes) from every block, and extract top item at every step.
Extracting takes O(log(M)) per output element, so full merging is O(n*log(M))
For your artificial example: O(n*log(n))

Checking if two substring overlaps in O(n) time

If I have a string S of length n, and a list of tuples (a,b), where a specifies the staring position of the substring of S and b is the length of the substring. To check if any substring overlaps, we can, for example, mark the position in S whenever it's touched. However, I think this will take O(n^2) time if the list of tuples has a size of n (looping the tuple list, then looping S).
Is it possible to check if any substring actually overlaps with the other in O(n) time?
Edit:
For example, S = "abcde". Tuples = [(1,2),(3,3),(4,2)], representing "ab","cde" and "de". I want to the know an overlap is discovered when (4,2) is read.
I was thinking it is O(n^2) because you get a tuple every time, then you need to loop through the substring in S to see if any character is marked dirty.
Edit 2:
I cannot exit once a collide is detected. Imagine I need to report all the subsequent tuples that collide, so i have to loop through the whole tuple list.
Edit 3:
A high level view of the algorithm:
for each tuple (a,b)
for (int i=a; i <= a+b; i++)
if S[i] is dirty
then report tuple and break //break inner loop only
Your basic approach is correct, but you could optimize your stopping condition, in a way that guarantees bounded complexity in the worst case. Think about it this way - how many positions in S would you have to traverse and mark in the worst case?
If there is no collision, then at worst you'll visit length(S) positions (and run out of tuples by then, since any additional tuple would have to collide). If there is a collision - you can stop at the first marked object, so again you're bounded by the max number of unmarked elements, which is length(S)
EDIT: since you added a requirement to report all colliding tuples, let's calculate this again (extending my comment) -
Once you marked all elements, you can detect collision for every further tuple with a single step (O(1)), and therefore you would need O(n+n) = O(n).
This time, each step would either mark an unmarked element (overall n in the worst case), or identify a colliding tuple (worst O(tuples) which we assume is also n).
The actual steps may be interleaved, since the tuples may be organized in any way without colliding first, but once they do (after at most n tuples which cover all n elements before colliding for the first time), you have to collide every time on the first step. other arrangements may collide earlier even before marking all elements, but again - you're just rearranging the same number of steps.
Worst case example: one tuple covering the entire array, then n-1 tuples (doesn't matter which) -
[(1,n), (n,1), (n-1,1), ...(1,1)]
First tuple would take n steps to mark all elements, the rest would take O(1) each to finish. overall O(2n)=O(n). Now convince yourself that the following example takes the same number of steps -
[(1,n/2-1), (1,1), (2,1), (3,1), (n/2,n/2), (4,1), (5,1) ...(n,1)]
According to your description and comment, the overlap problem may be not about string algorithm, it can be regarded as "segment overlap" problem.
Just use your example, it can be translated to 3 segments: [1, 2], [3, 5], [4, 5]. The question is to check whether the 3 segments have overlap.
Suppose we have m segments each have format [start, end] which means segment start position and end position, one efficient algorithm to detect overlap is to sort them by start position in ascending order, it takes O(m * lgm). Then iterate the sorted m segments, for each segment, try to find whether its end position, here you only need to check:
if(start[i] <= max(end[j], 1 <= j <= i-1) {
segment i is overlap;
}
maxEnd[i] = max(maxEnd[i-1], end[i]); // update max end position of 1 to i
Which each check operation takes O(1). Then the total time complexity is O(m*lgm + m), which can be regarded as O(m*lgm). While for each output, time complexity is related to each tuple's length, which is also related to n.
This is a segment overlap problem and the solution should be possible in O(n) itself if the list of tuples has been sorted in ascending order wrt the first field. Consider the following approach:
Transform the intervals from (start, number of characters) to (start, inclusive_end). Hence the above example becomes: [(1,2),(3,3),(4,2)] ==> [(1, 2), (3, 5), (4, 5)]
The tuples are valid if transformed consecutive tuples (a, b) and (c, d) always follow b < c. Else there is an overlap in the tuples mentioned above.
Each of 1 and 2 can be done in O(n) if the array is sorted in the form mentioned above.

How to replace the elements of a range less than k with k?

How do I replace the elements in the range of an array greater than k by k when the number of queries are high?
Given that each query is given by the form l r k; where [l...r] is the range of the array.
Since my first answer created big thread of comments I'm going to combine everything in new answer.
We are going to use Segment Tree as helper data-structure which will be used to answer this question: what is the minimum on range [l, r]. Initially all segment tree nodes will be filled with some "Infinity" numbers which can be 201 in your problem (since all K are lower then 200 based on your comment).
Once we read our input array (lets call it A) we are going to process queries:
for each query [L, R, K] we are going to update our segment tree: try to set new minimum K on range [L, R]. That could be done with O(LogN) using lazy propagation. Here is a great example http://se7so.blogspot.com/2012/12/segment-trees-and-lazy-propagation.html
now we need to build final array. We are iterating over each index in our array and replace it with A[i] = min(A[i], minimum_on_range(i, i)). That will take N * Log(N) steps
Total complexity of that approach is M * Log(N) + N * Log(N)

The number of distinct integers in an array is O(log n). How to get an O(n log log n) worst-case time algorithm to sort such sequences?

The question is from the The Algorithm Design Manual. I have been working on it but haven't found a method to arrive at the right answer.
Question:
We seek to sort a sequence S of n integers with many duplications, such that the number of distinct integers in S is O(log n). Give an O(n log log n) worst-case time algorithm to sort such sequences.
I think maybe can first pick all these distinct elements and form an array of logn length and then record the frequency and sort it. However the my first step seems to blow up running time too much...Is any superior selection method or is my method totally wrong? Thanks
Use a balanced binary tree to calculate the number of occurrences of each number. Since there are only log N distinct numbers, the size of the tree is log N, and thus all operations is performed in log log N. (this is exactly how map<> is implemented is C++)
Then, just iterate the nodes of the tree in a pre-order traversal, and print each integer the required number of times in this order.
Create an array containing pairs of (unique numbers, count). The array is initially empty and kept sorted.
For each number in your original array, look the number up in the sorted array using binary search. Since the array has size O (log N), the binary search each time takes O (log log N), you do that N times, total O (N log log N). When found, you increase the count.
When not found, you insert the new number with a count of 1. This operation only happens O (log N) times, and is trivially done in O (log N) steps, for a total of O (log^2 N), which is much smaller than O (N log log N).
When you are done, fill the original array with the required numbers. That takes O (N).
There's really no need to create a balanced sorted tree to make the insertions faster, because the set of unique numbers is so small.
If the set of integers is all contained in a range X ≤ number ≤ Y, then the problem can be solved in O (max (N, Y - X + 1)) using an array of X - Y + 1 counters and not even bothering to find unique numbers. The technique is reportedly used to great effect in Iain Banks' book "Player of Games".

How to locate in a huge list of numbers, two numbers where xi=xj?

I have the following question, and it screams at me for a solution with hashing:
Problem :
Given a huge list of numbers, x1........xn where xi <= T, we'd like to know
whether or not exists two indices i,j, where x_i == x_j.
Find an algorithm in O(n) run time, and also with expectancy of O(n), for the problem.
My solution at the moment : We use hashing, where we'll have a mapping function h(x) using chaining.
First - we build a new array, let's call it A, where each cell is a linked list - this would be the destination array.
Now - we run on all the n numbers and map each element in x1........xn, to its rightful place, using the hash function. This would take O(n) run time.
After that we'll run on A, and look for collisions. If we'll find a cell where length(A[k]) > 1
then we return the xi and xj that were mapped to the value that's stored in A[k] - total run time here would be O(n) for the worst case , if the mapped value of two numbers (if they indeed exist) in the last cell of A.
The same approach can be ~twice faster (on average), still O(n) on average - but with better constants.
No need to map all the elements into the hash and then go over it - a faster solution could be:
for each element e:
if e is in the table:
return e
else:
insert e into the table
Also note that if T < n, there must be a dupe within the first T+1 elements, from pigeonhole principle.
Also for small T, you can use a simple array of size T, no hash is needed (hash(x) = x). Initializing T can be done in O(1) to contain zeros as initial values.

Resources