The number of distinct integers in an array is O(log n). How to get an O(n log log n) worst-case time algorithm to sort such sequences? - arrays

The question is from the The Algorithm Design Manual. I have been working on it but haven't found a method to arrive at the right answer.
Question:
We seek to sort a sequence S of n integers with many duplications, such that the number of distinct integers in S is O(log n). Give an O(n log log n) worst-case time algorithm to sort such sequences.
I think maybe can first pick all these distinct elements and form an array of logn length and then record the frequency and sort it. However the my first step seems to blow up running time too much...Is any superior selection method or is my method totally wrong? Thanks

Use a balanced binary tree to calculate the number of occurrences of each number. Since there are only log N distinct numbers, the size of the tree is log N, and thus all operations is performed in log log N. (this is exactly how map<> is implemented is C++)
Then, just iterate the nodes of the tree in a pre-order traversal, and print each integer the required number of times in this order.

Create an array containing pairs of (unique numbers, count). The array is initially empty and kept sorted.
For each number in your original array, look the number up in the sorted array using binary search. Since the array has size O (log N), the binary search each time takes O (log log N), you do that N times, total O (N log log N). When found, you increase the count.
When not found, you insert the new number with a count of 1. This operation only happens O (log N) times, and is trivially done in O (log N) steps, for a total of O (log^2 N), which is much smaller than O (N log log N).
When you are done, fill the original array with the required numbers. That takes O (N).
There's really no need to create a balanced sorted tree to make the insertions faster, because the set of unique numbers is so small.
If the set of integers is all contained in a range X ≤ number ≤ Y, then the problem can be solved in O (max (N, Y - X + 1)) using an array of X - Y + 1 counters and not even bothering to find unique numbers. The technique is reportedly used to great effect in Iain Banks' book "Player of Games".

Related

How are big O notation equations assigned in search algorithms?

For linear search it makes sense that the run time is big O of N since it will always be one step. As for my understanding of bubble sort it's runtime is O of n^2 this makes sense to me because you'd iterate the number of elements in the an array and each time compare two values till the end of said array.
But for merge sort it's always splitting the data in half, so I'm confused as to explanation as to why the run time is n log n. Additionally I want to clarify my understanding of insertion sorts runtime big O of n^2. Since insertion sort looks for the smallest number then compares it to every single number of the array it would be n^2 because it will loop through the array contents for every iteration.
If I could be given some advice about merge sort, and general understanding of run times that'd be appreciated. I am an absolute newbie and wanted to throw that disclaimer out there.
Let's assume that sorting of an array of N elements is taking T(N) time. In merge sort we know that we need to sort two arrays of N/2 elements (that is 2*T(N/2)) and then merge them (in O(N) time complexity, that is c*N for some constant c).
So, T(N) = 2T(N/2) + c*N.
We could stop here, as it is basically the "equation" you asking about. But let's go a bit further.
To simplify things, we can show that T(N) = kN log N as follows (for some constant k):
Let's substitute T on both sides of the equation we have derived:
kN log N = 2 * k*(N/2) log (N/2) + c*N
and expand the right hand side (assuming log with base 2):
= k*N *(log N - log 2) + c*N = k*N *(log N - 1) + c*N = kNlog N + (c-k)N
That is for c=k the equality holds, and it proves that T(N) if of a form kN log N, that is O(N log N)

Efficient way to search within unsorted array

I have an unsorted array A containing value within range 0 to 100. I have multiple query of format QUERY(starting array index, ending array index, startValue, endValue). I want to return array of indexes whose value lies within startValue and endValue. Naive approach is taking O(n) time for each query and i needed efficient algorithm. Also, query are not known initially.
There are some tradeoffs in terms of memory usage, preprocessing time and query time. Let h be the range of possible values (101 in your case). Ideally you would like your queries to take O(m) time, where m is the number of indexes returned. Here are some approaches.
2-d trees. Each array element V[x] = y corresponds to a 2-d point (x, y). Each query (start, end, min, max) corresponds to a range query in the 2-d tree between those boundaries. This implementation needs O(n) memory, O(n log n) preprocessing time and O(sqrt n + m) time per query (see the complexity section). Notably, this does not depend on h.
Sorted arrays + min-heap (Arguably an easier implementation if you roll your own).
Build h sorted arrays P0...h where Pk is the array of positions where the value k occurs in the original array. This takes O(n) memory and O(n) preprocessing time.
Now we can answer in O(log n) (using binary search) queries of the form next(pos, k): "starting at position pos, where does the next value of k occur?"
To answer a query (start, end, min, max), begin by collecting next(start, min), next(start, min + 1), ..., next(start, max) and build a min-heap with them. This takes O(h log n) time. Then, while the minimum of the heap is at most end, remove it from the heap, add it to the list of indices to return, and add in its place the next element from its corresponding P array. This yields a complexity of O(h log n + m log h) per query.
I have two more ideas based on the linearithmic approach to range minimum queries, but they require O(nh) and O(nh log h) space respectively. The query time is improved to O(m). If that is not prohibitive, please let me know and I will edit the answer to elaborate.

find nth-smallest value across m sorted arrays using idea from 2 sorted arrays

May I ask whether would it be possible? the general approach would be somehow like find n-th value on two sorted array, to ignore the insignificants and try to focus on the rest by adjusting the value of n in recursion
The 2 sorted arrays problem would yield a computation time O(log(|A|)+log(|B|), while the question is similar, I would like to ask if there exist algorithm for m sorted arrays for time O(log(|A1|)+log(|A2|)+...+log(|Am|)),
or some similar variation that is near the time I mentioned above (due to the variable m, we might need some other sorting algorithm for the pivots from those arrays),
or if such algorithm doesn't exist, why?
I just can't find this algorithm from googling
There is a simple randomized algorithm:
Select a pivot randomly from any of the m arrays. Let's call it x
For every array, do a binary search for x to find out how many values < x are in the array. Say we have ri values < x in array i. We know that x has rank r = sum(i = 1 to m, ri) in the union of all arrays.
If n <= r, we restrict each array i to the indices 0...(ri - 1) and recurse. If n > r, we restrict each array to the indices ri...|Ai | - 1
repeat
The expected recursion depth is O(log(N)) with N being the total number of elements, with a proof similar to that of Quickselect, so the expected running time is something like O(m * log2(N)).
The paper "Generalized Selection and Ranking" by Frederickson and Johnson proposes selection and ranking algorithms for different scenarios, for example an O(m + c * log(k/c)) algorithm to select the k-th element from m equally sized sorted sequences, with c = min{m, k}.

finding the maximum number in array

there is an array of numbers an this array is irregular and we should find a maximum number (n) that at least n number is bigger than it (this number may be in array and may not be in array )
for example if we give 2 5 7 6 9 number 4 is maximum number that at least 4 number (or more than it ) is bigger than 4 (5 6 7 9 are bigger)
i solve this problem but i think it gives time limit in big array of numbers so i want to resolve this problem in another way
so i use merge sort for sorting that because it take nlog(n) and then i use a counter an it counts from 1 to k if we have k number more than k we count again for example we count from 1 to 4 then in 5 we don't have 5 number more than 5 so we give k-1 = 4 and this is our n .
it's good or it maybe gives time limit ? does anybody have another idea ?
thanks
In c++ there is a function called std::nth_element and it can find the nth element of an array in linear time. Using this function you should find the N - n- th element (where N is the total number of elements in the array) and subtract 1 from it.
As you seek a solution in C you can not make use of this function, but you can implement your solution similarly. nth_element performs something quite similar to qsort, but it only performs partition on the part of the array where the n-th element is.
Now let's assume you have nth_element implemented. We will perform something like combination of binary search and nth_element. First we assume that the answer of the question is the middle element of the array (i.e. the N/2-th element). We use nth_element and we find the N/2th element. If it is more than N/2 we know the answer to your problem is at least N/2, otherwise it will be less. Either way in order to find the answer we will only continue with one of the two partitions created by the N/2th element. If this partition is the right one(elements bigger than N/2) we continue solving the same problem, otherwise we start searching for the max element M on the left of the N/2th element that has at least x bigger elements such that x + N/2 > M. The two subproblems will have the same complexity. You continue performing this operation until the interval you are interested in is of length 1.
Now let's prove the complexity of the above algorithm is linear. First nth_element is linear performing operations in the order of N, second nth_element that only considers one half of the array will perform operations in the order of N/2 the third - in the order of N/4 and so on. All in all you will perform operations in the order of N + N/2 + N/4 + ... + 1. This sum is less than 2 * N thus your complexity is still linear.
Your solution is asymptotically slower than what I propose above as it has a complexity O(n*log(n)), while my solution has complexity of O(n).
I would use a modified variant of a sorting algorithm that uses pivot values.
The reason is that you want to sort as few elements as possible.
So I would use qsort as my base algorithm and let the pivot element control which partition to sort (you will only need to sort one).

How to locate in a huge list of numbers, two numbers where xi=xj?

I have the following question, and it screams at me for a solution with hashing:
Problem :
Given a huge list of numbers, x1........xn where xi <= T, we'd like to know
whether or not exists two indices i,j, where x_i == x_j.
Find an algorithm in O(n) run time, and also with expectancy of O(n), for the problem.
My solution at the moment : We use hashing, where we'll have a mapping function h(x) using chaining.
First - we build a new array, let's call it A, where each cell is a linked list - this would be the destination array.
Now - we run on all the n numbers and map each element in x1........xn, to its rightful place, using the hash function. This would take O(n) run time.
After that we'll run on A, and look for collisions. If we'll find a cell where length(A[k]) > 1
then we return the xi and xj that were mapped to the value that's stored in A[k] - total run time here would be O(n) for the worst case , if the mapped value of two numbers (if they indeed exist) in the last cell of A.
The same approach can be ~twice faster (on average), still O(n) on average - but with better constants.
No need to map all the elements into the hash and then go over it - a faster solution could be:
for each element e:
if e is in the table:
return e
else:
insert e into the table
Also note that if T < n, there must be a dupe within the first T+1 elements, from pigeonhole principle.
Also for small T, you can use a simple array of size T, no hash is needed (hash(x) = x). Initializing T can be done in O(1) to contain zeros as initial values.

Resources