Sort objects by timestamp, but then group dependencies

Sort objects by timestamp, but then group dependencies - arrays

Let's say I have a list (array) of objects. Each of these objects has two properties: a timestamp and an optional parent object, which can be null. I'd like to first sort this array by timestamps, which is easy enough; but then, I'd want the dependent objects to be kept consecutive.
For example, consider this simplified example: three objects, A, B, and C. B's parent is A, but the timestamps are A=1, B=3, C=2. Sorting by timestamp gives [A, C, B], but then because B's parent is A, I want B to come after A; so the ideal result should be [A, B, C] after all.
Note that if two or more objects have the same parent, they should all be adjacent, but they should be relatively sorted by timestamp still.
What's the best way to do this? This only way I can think is to sort by timestamp, then iterate through the array and, for each dependent object, move it after its parent; but that seems inefficient since it calls for an extra round of iteration. Is there some way to incorporate the grouping into the initial sorting so it can complete with only one round of sorting? (I'm currently using QuickSort, but if need be, I can switch to another algorithm.)

Brute force non-working approach - one option to perform sorting in one single operation you would need to make parent part of sort key in a following way and than sort by {order(Node.Parent), timestamp(Node)} pairs using any algorithm you like.
"A is parent of B" => "order(A) < order(B)" and
"C.timestamp < D.timestamp" => order(C) < order(D)
Unfortunately this "order" function requires sorting of all child nodes first to satisfy second condition thus breaking "one sort" requirement.
To get single sort you can use composite key that includes timestamps for all parent nodes and then sort by such composite key.
The easiest way to build composite key is to construct tree based on parent objects and set value of the key to be concatenation of parent's key and own timestamp using any tree traversal.
Sample:
Data
A (ts = 5) parent of B (ts = 7),C (ts = 2)
B parent of D (ts = 3)
Building tree:
A -> B -> D
-> C
Pre-order traversal: A, B, D, C
composite key -
A -> A.timestamp = 5
B -> key(A) concat B.timestamp = 5.7
C -> key(A) concat C.timestamp = 5.2
D -> key(B) concat D.timestamp = 5.7.2
data for sorting by {order, timestamp} pairs
A {order(no-parent), ts} = {0, 5}
B {order(A), ts} = {1,7}
C {1,2}
D {2,3}
sorted sequences - {5}, {5.2},{5.7},{5.7.2} mapping back to nodes - A,C,B,D
Complexity of this approach is O(n log(n) max_depth):
build tree/walk tree/build keys - O(n)
sort is complexity of sort (usually O(num_elm log(num_elem)) multiplied by complexity of comparing keys which are depending on depth of parent-child tree. This part dominates O(n) needed for preparation phase.
Alternatively you can just build tree, sort each level by time-stamp and than put them back in a list via pre-order traversal, which removes complexity of key comparison but breaks requirement of single sort.

You could sort the objects into lexicographical order using a sequence of one or two numbers as sort key, where if an object has no parent it has a single element in the sequence which is its number, and if an object has a parent the first element in the sequence is its parents number and the second is its own number.
So A, B, and C get sequences {1}, {1, 3}, and {2} and B sorts just after its parent.

Related

Three Way Quicksort for arrays with many duplicates: Partition Placement

I'm working on a variation of the Quick Sort Algorithm designed to handle arrays where there may be many duplicate elements. The basic idea is to divide the array into 3 partitions first: All Elements below the Pivot Values (with the initial pivot value being chosen at random); All Elements Equal to the Pivot Value; and All Elements Greater than the Pivot Value.
I need some advice regarding the best way to arrange the Partition..
What is the best way to arrange the Partition in a Three Way Quick Sort?
The first way I might go about it is to just keep the Pivot Partition on the left, which would make it easy to define the boundaries when I return them to the larger Quick Sort function I plan to nest the Partition function within. But that makes subsequent recursive calls to sort the Above and Below Partitions a little tricky, since they would be all lumped together in one large partition above the Pivot Partition to start with (instead of being more neatly organized into an Above and Below Partition). I could call a For Loop to insert each of these elements above and below the Pivot Partition, but I suspect that would mitigate the efficiency of the algorithm. After doing this, I could make two recursive calls to Quick Sort: once on the Below Partition, and again on the Above Partition.
OR I could modify Partition to insert "Below" elements to the left of the Pivot Partition, and insert "Above" elements to the right. This reduces the need for linear scans over the array, but it means I would have to update the left and right bounds of the partition as the Partition function operates over the array.
I believe the second choice is the better one, but I want to see if anyone has any other ideas.
For reference, the initial array might look something like this:
array = [2, 2, 1, 9, 2]
Assuming the Pivot is randomly chosen as value of "2", then after Partition, it could look either like this:
array = [2, 2, 2, 9, 1]
Or like this if I insert above and below the partition during the Partition Function:
array = [1, 2, 2, 2, 9]
And the "shell code" I'm supposed to build this function around looks like this:
def randomized_quick_sort(a, l, r):
if l >= r:
return
k = random.randint(l, r)
a[l], a[k] = a[k], a[l]
left_part_bound, right_part_bound = partition3(a, l, r)
randomized_quick_sort(a, l, left_part_bound - 1)
randomized_quick_sort(a, right_part_bound + 1, r)
*The end result doesn't need to look like this (I just need to be able to output the right result and be able to resolve within a time limit to demonstrate minimal efficiency), but it shows why I think I may need to create Above and Below partitions as I'm creating the Pivot Partition.

Queries on Array

You are given an array of N positive integers, A1,A2,…,An. You have to answer Q queries. Each query consists of two integers L and K. For each query I have to tell the Kth element which is larger than or equal to L in the array when all such elements are listed according to increasing order of their indices.
Example A = 22,44,12,16,14,88,25,49
Query 1: L = 3 K = 4
Since all elements are greater than 3. So we list the whole array i.e. 22,44,12,16,14,88,25,49. 4th element among these elements is 16
Query 2: L = 19 K = 5
Listed elements 22,44,88,25,49. 5th element among these is 49.
What I have done: for each query iterate over the whole array and check the Kth element which is larger than or equal to L. Complexity : O(Q*N)
What I require: O(Q*logN) complexity.
Constraints : 1<=Q<=10^5 1<=N<=10^5 1<=Ai<=10^5

One possible way to solve this task is to use immutable binary (RB) tree.
First you need to sort your array in ascending order, storing original indices of the elements next to the elements.
Traverse array in reverse (descending) order, adding elements one by one to immutable binary tree. Key in the tree is original index of the element. Tree is immutable, so by adding element I mean constructing new tree with added element. Save the tree created on each step near the corresponding element (the element that was last added to the tree).
Having these trees constructed for each element you can do your queries in O(log N) time.
Query:
First, perform a binary search for L in sorted array (O(log N)) for the element that is bigger than L. You'll find the element and corresponding tree of indices of the elements that are bigger than L. In this tree you can find K-th largest index in O(log N) time.
The whole algorithm will take O(N log N + Q log N) time. I don't believe that it is possible to do better (as sorting the original array seems to be inevitable).
The key of this approach is using immutable binary tree. This structure shares the properties of mutable binary tree, such as insertion and search in O(log N), while staying immutable. When you add element to such tree, the previous version of the tree is preserved, only nodes that are different from the previous "version" of the tree are recreated. Usually it's O(log N) nodes. Thus creating N trees from elements of your array will require O(N log N) time and O(N log N) space.
You can use immutable RB tree implementation in Scala as the reference.

Checking if two substring overlaps in O(n) time

If I have a string S of length n, and a list of tuples (a,b), where a specifies the staring position of the substring of S and b is the length of the substring. To check if any substring overlaps, we can, for example, mark the position in S whenever it's touched. However, I think this will take O(n^2) time if the list of tuples has a size of n (looping the tuple list, then looping S).
Is it possible to check if any substring actually overlaps with the other in O(n) time?
Edit:
For example, S = "abcde". Tuples = [(1,2),(3,3),(4,2)], representing "ab","cde" and "de". I want to the know an overlap is discovered when (4,2) is read.
I was thinking it is O(n^2) because you get a tuple every time, then you need to loop through the substring in S to see if any character is marked dirty.
Edit 2:
I cannot exit once a collide is detected. Imagine I need to report all the subsequent tuples that collide, so i have to loop through the whole tuple list.
Edit 3:
A high level view of the algorithm:
for each tuple (a,b)
for (int i=a; i <= a+b; i++)
if S[i] is dirty
then report tuple and break //break inner loop only

Your basic approach is correct, but you could optimize your stopping condition, in a way that guarantees bounded complexity in the worst case. Think about it this way - how many positions in S would you have to traverse and mark in the worst case?
If there is no collision, then at worst you'll visit length(S) positions (and run out of tuples by then, since any additional tuple would have to collide). If there is a collision - you can stop at the first marked object, so again you're bounded by the max number of unmarked elements, which is length(S)
EDIT: since you added a requirement to report all colliding tuples, let's calculate this again (extending my comment) -
Once you marked all elements, you can detect collision for every further tuple with a single step (O(1)), and therefore you would need O(n+n) = O(n).
This time, each step would either mark an unmarked element (overall n in the worst case), or identify a colliding tuple (worst O(tuples) which we assume is also n).
The actual steps may be interleaved, since the tuples may be organized in any way without colliding first, but once they do (after at most n tuples which cover all n elements before colliding for the first time), you have to collide every time on the first step. other arrangements may collide earlier even before marking all elements, but again - you're just rearranging the same number of steps.
Worst case example: one tuple covering the entire array, then n-1 tuples (doesn't matter which) -
[(1,n), (n,1), (n-1,1), ...(1,1)]
First tuple would take n steps to mark all elements, the rest would take O(1) each to finish. overall O(2n)=O(n). Now convince yourself that the following example takes the same number of steps -
[(1,n/2-1), (1,1), (2,1), (3,1), (n/2,n/2), (4,1), (5,1) ...(n,1)]

According to your description and comment, the overlap problem may be not about string algorithm, it can be regarded as "segment overlap" problem.
Just use your example, it can be translated to 3 segments: [1, 2], [3, 5], [4, 5]. The question is to check whether the 3 segments have overlap.
Suppose we have m segments each have format [start, end] which means segment start position and end position, one efficient algorithm to detect overlap is to sort them by start position in ascending order, it takes O(m * lgm). Then iterate the sorted m segments, for each segment, try to find whether its end position, here you only need to check:
if(start[i] <= max(end[j], 1 <= j <= i-1) {
segment i is overlap;
}
maxEnd[i] = max(maxEnd[i-1], end[i]); // update max end position of 1 to i
Which each check operation takes O(1). Then the total time complexity is O(m*lgm + m), which can be regarded as O(m*lgm). While for each output, time complexity is related to each tuple's length, which is also related to n.

This is a segment overlap problem and the solution should be possible in O(n) itself if the list of tuples has been sorted in ascending order wrt the first field. Consider the following approach:
Transform the intervals from (start, number of characters) to (start, inclusive_end). Hence the above example becomes: [(1,2),(3,3),(4,2)] ==> [(1, 2), (3, 5), (4, 5)]
The tuples are valid if transformed consecutive tuples (a, b) and (c, d) always follow b < c. Else there is an overlap in the tuples mentioned above.
Each of 1 and 2 can be done in O(n) if the array is sorted in the form mentioned above.

How to locate in a huge list of numbers, two numbers where xi=xj?

I have the following question, and it screams at me for a solution with hashing:
Problem :
Given a huge list of numbers, x1........xn where xi <= T, we'd like to know
whether or not exists two indices i,j, where x_i == x_j.
Find an algorithm in O(n) run time, and also with expectancy of O(n), for the problem.
My solution at the moment : We use hashing, where we'll have a mapping function h(x) using chaining.
First - we build a new array, let's call it A, where each cell is a linked list - this would be the destination array.
Now - we run on all the n numbers and map each element in x1........xn, to its rightful place, using the hash function. This would take O(n) run time.
After that we'll run on A, and look for collisions. If we'll find a cell where length(A[k]) > 1
then we return the xi and xj that were mapped to the value that's stored in A[k] - total run time here would be O(n) for the worst case , if the mapped value of two numbers (if they indeed exist) in the last cell of A.

The same approach can be ~twice faster (on average), still O(n) on average - but with better constants.
No need to map all the elements into the hash and then go over it - a faster solution could be:
for each element e:
if e is in the table:
return e
else:
insert e into the table
Also note that if T < n, there must be a dupe within the first T+1 elements, from pigeonhole principle.
Also for small T, you can use a simple array of size T, no hash is needed (hash(x) = x). Initializing T can be done in O(1) to contain zeros as initial values.

Growing arrays in Haskell

I have the following (imperative) algorithm that I want to implement in Haskell:
Given a sequence of pairs [(e0,s0), (e1,s1), (e2,s2),...,(en,sn)], where both "e" and "s" parts are natural numbers not necessarily different, at each time step one element of this sequence is randomly selected, let's say (ei,si), and based in the values of (ei,si), a new element is built and added to the sequence.
How can I implement this efficiently in Haskell? The need for random access would make it bad for lists, while the need for appending one element at a time would make it bad for arrays, as far as I know.
Thanks in advance.

I suggest using either Data.Set or Data.Sequence, depending on what you're needing it for. The latter in particular provides you with logarithmic index lookup (as opposed to linear for lists) and O(1) appending on either end.

"while the need for appending one element at a time would make it bad for arrays" Algorithmically, it seems like you want a dynamic array (aka vector, array list, etc.), which has amortized O(1) time to append an element. I don't know of a Haskell implementation of it off-hand, and it is not a very "functional" data structure, but it is definitely possible to implement it in Haskell in some kind of state monad.

If you know approx how much total elements you will need then you can create an array of such size which is "sparse" at first and then as need you can put elements in it.
Something like below can be used to represent this new array:
data MyArray = MyArray (Array Int Int) Int
(where the last Int represent how many elements are used in the array)

If you really need stop-and-start resizing, you could think about using the simple-rope package along with a StringLike instance for something like Vector. In particular, this might accommodate scenarios where you start out with a large array and are interested in relatively small additions.
That said, adding individual elements into the chunks of the rope may still induce a lot of copying. You will need to try out your specific case, but you should be prepared to use a mutable vector as you may not need pure intermediate results.
If you can build your array in one shot and just need the indexing behavior you describe, something like the following may suffice,
import Data.Array.IArray
test :: Array Int (Int,Int)
test = accumArray (flip const) (0,0) (0,20) [(i, f i) | i <- [0..19]]
where f 0 = (1,0)
f i = let (e,s) = test ! (i `div` 2) in (e*2,s+1)

Taking a note from ivanm, I think Sets are the way to go for this.
import Data.Set as Set
import System.Random (RandomGen, getStdGen)
startSet :: Set (Int, Int)
startSet = Set.fromList [(1,2), (3,4)] -- etc. Whatever the initial set is
-- grow the set by randomly producing "n" elements.
growSet :: (RandomGen g) => g -> Set (Int, Int) -> Int -> (Set (Int, Int), g)
growSet g s n | n <= 0 = (s, g)
| otherwise = growSet g'' s' (n-1)
where s' = Set.insert (x,y) s
((x,_), g') = randElem s g
((_,y), g'') = randElem s g'
randElem :: (RandomGen g) => Set a -> g -> (a, g)
randElem = undefined
main = do
g <- getStdGen
let (grownSet,_) = growSet g startSet 2
print $ grownSet -- or whatever you want to do with it
This assumes that randElem is an efficient, definable method for selecting a random element from a Set. (I asked this SO question regarding efficient implementations of such a method). One thing I realized upon writing up this implementation is that it may not suit your needs, since Sets cannot contain duplicate elements, and my algorithm has no way to give extra weight to pairings that appear multiple times in the list.