Why is the complexity of merging M sorted arrays linear time? - arrays

Suppose we want to perform external sort and have M number of blocks sorted, where each block contains k comparable items such that n = Mk. k in this case also refers to the maximum number of items you can fit into memory for sorting, and n is the total number of items to sort.
Then using the merge function in merge sort, each element will have to be compared against all other elements from other blocks, which gives me O(M) comparison for one element. Since we have to do this for all elements, we will have O(M * Mk) = O(M^2 * k) = O(nM) time complexity.
This seems to be linear at first, but suppose in the worst case we can only fit 1 item into memory. So we have M=n blocks, and the time complexity is O(n^2) directly. How does the merging gives you linear time in external sort?
Also, in the case where k = 1, how is the sorting even feasible when there cannot be any comparisons done?

Make priority queue based? for example, on binary heap, fill it with current items (or their indexes) from every block, and extract top item at every step.
Extracting takes O(log(M)) per output element, so full merging is O(n*log(M))
For your artificial example: O(n*log(n))

Related

Find Number of pairs (a, b) in an array such that a % b = 0

I am getting the solution in O(N^2)
here is my solution:
arr = [4,6,5,5,7,7,8]
count = 0
for i in range(n):
for j in range(n):
if (i!=j and arr[i]%arr[j]==0):
count = count + 1
return count
You have to check each item against another item, so the general code will be O(n^2).
You might be able to make it more efficient with knowledge of the structure of the problem.
For example if you have repeated values you can create a unique list first: this can have complexity O(n).
It might also be worth sorting the list so that you only divide larger numbers by smaller, but sorting is at worst O(n^2) anyway averaging at O(n log n).
You might be able to use clever mathematical properties depending on the number ranges to exclude a large part of the search (if the data has been sorted)

Efficient way to search within unsorted array

I have an unsorted array A containing value within range 0 to 100. I have multiple query of format QUERY(starting array index, ending array index, startValue, endValue). I want to return array of indexes whose value lies within startValue and endValue. Naive approach is taking O(n) time for each query and i needed efficient algorithm. Also, query are not known initially.
There are some tradeoffs in terms of memory usage, preprocessing time and query time. Let h be the range of possible values (101 in your case). Ideally you would like your queries to take O(m) time, where m is the number of indexes returned. Here are some approaches.
2-d trees. Each array element V[x] = y corresponds to a 2-d point (x, y). Each query (start, end, min, max) corresponds to a range query in the 2-d tree between those boundaries. This implementation needs O(n) memory, O(n log n) preprocessing time and O(sqrt n + m) time per query (see the complexity section). Notably, this does not depend on h.
Sorted arrays + min-heap (Arguably an easier implementation if you roll your own).
Build h sorted arrays P0...h where Pk is the array of positions where the value k occurs in the original array. This takes O(n) memory and O(n) preprocessing time.
Now we can answer in O(log n) (using binary search) queries of the form next(pos, k): "starting at position pos, where does the next value of k occur?"
To answer a query (start, end, min, max), begin by collecting next(start, min), next(start, min + 1), ..., next(start, max) and build a min-heap with them. This takes O(h log n) time. Then, while the minimum of the heap is at most end, remove it from the heap, add it to the list of indices to return, and add in its place the next element from its corresponding P array. This yields a complexity of O(h log n + m log h) per query.
I have two more ideas based on the linearithmic approach to range minimum queries, but they require O(nh) and O(nh log h) space respectively. The query time is improved to O(m). If that is not prohibitive, please let me know and I will edit the answer to elaborate.

Time Complexity of Insertion and Selection sort When there are only two key values in an array

I am reviewing Algorithm, 4th Editon by sedgewick recently, and come across such a problem and cannot solve it.
The problem goes like this:
2.1.28 Equal keys. Formulate and validate hypotheses about the running time of insertion
sort and selection sort for arrays that contain just two key values, assuming that
the values are equally likely to occur.
Explanation: You have n elements, each can be 0 or 1 (without loss of generality), and for each element x: P(x=0)=P(x=1).
Any help will be welcomed.
Selection sort:
The time complexity is going to remain the same (as it is without the 2 keys assumption), it is independent on the values of the arrays, only the number of elements.
Time complexity for selection sort in this case is O(n^2)
However, this is true only for the original algorithm that scans the entire tail of the array for each outer loop iteration. if you optimize it to find the next "0", at iteration i, since you have already "cleared" the first i-1 zeros, the i'th zero mean location is at index 2i. This means each time, the inner loop will need to do 2i-(i-1)=i+1 iterations.
Suming it up will be:
1 + 2 + ... + n = n(n+1)/2
Which is, unfortunately, still in O(n^2).
Another optimization could be to "remmber" where you have last stopped. This will significantly improve complexity to O(n), since you don't need to traverse the same element more than once - but that's going to be a different algorithm, not selection sort.
Insertion Sort:
Here, things are more complicated. Note that in the inner loop (taken from wikipedia), the number of operations depends on the values:
while j > 0 and A[j-1] > x
However, recall that in insertion sort, after the ith step, the first i elements are sorted. Since we are assuming P(x=0)=P(x=1), an average of i/2 elements are 0's and i/2 are 1's.
This means, the time complexity on average, for the inner loop is O(i/2).
Summing this up will get you:
1/2 + 2/2 + 3/2 + ... + n/2 = 1/2* (1+2+...+n) = 1/2*n(n+1)/2 = n(n+1)/4
The above is however, still in O(n^2).
The above is not a formal proof, because it implicitly uses E(f(E(x)) = E(f(x)), which is not true, but it can give you guidelines how to formally build your proof.
Well obviosuly you only need to search until you find the first 0, when searching for the next smmalest. For example, in the selection sort, you scan the array looking for the next smallest number to swap into the current position. Since there are only 0s and 1s you can stop the scan when encountering the first 0 (since it is the next smallest number), so there is no need to continue scanning the rest of the array in this cycle. If 0 is not found then the sorting is complete, since the "unsorted" portion is all 1s.
Insertion sort is basically the same. They are both O(N) in this case.

Find the tightest upper bound for a creation of an array (among few options)

Let A, an array of numbers. We need to create an array B such that B[i] = min(A[i],...,A[i+sqrt(n)].
Why is the tightest upper bound for the creation of B is O(nlogn)?
I was actually given a list of options:
O(sqrt(n)*logn)
O(n/logn)
O(n*logn)
O(nlog(n)^2)
O(n*sqrt(n))
O(n^2)
The answer is O(nlogn), since it is the lowest yet not sublinear option.
It can be done in O(nlogn) by maintaining a sorted DS (self balancing BST for example) of size sqrt(n), and iteratively remove and add elements to it (while running a sliding window on the array).
Each iteration is done in O(log(sqrt(n)) = O(1/2*log(n)) = O(logn), and there are O(n) iterations, so total of O(nlogn).
This disqualifies all "higher" alternatives, and all "lower" alternatives are sub-linear, and you cannot create the array in sublinear time.

How to locate in a huge list of numbers, two numbers where xi=xj?

I have the following question, and it screams at me for a solution with hashing:
Problem :
Given a huge list of numbers, x1........xn where xi <= T, we'd like to know
whether or not exists two indices i,j, where x_i == x_j.
Find an algorithm in O(n) run time, and also with expectancy of O(n), for the problem.
My solution at the moment : We use hashing, where we'll have a mapping function h(x) using chaining.
First - we build a new array, let's call it A, where each cell is a linked list - this would be the destination array.
Now - we run on all the n numbers and map each element in x1........xn, to its rightful place, using the hash function. This would take O(n) run time.
After that we'll run on A, and look for collisions. If we'll find a cell where length(A[k]) > 1
then we return the xi and xj that were mapped to the value that's stored in A[k] - total run time here would be O(n) for the worst case , if the mapped value of two numbers (if they indeed exist) in the last cell of A.
The same approach can be ~twice faster (on average), still O(n) on average - but with better constants.
No need to map all the elements into the hash and then go over it - a faster solution could be:
for each element e:
if e is in the table:
return e
else:
insert e into the table
Also note that if T < n, there must be a dupe within the first T+1 elements, from pigeonhole principle.
Also for small T, you can use a simple array of size T, no hash is needed (hash(x) = x). Initializing T can be done in O(1) to contain zeros as initial values.

Resources