sorting algorithm with pagination - c

I want to sort a list of entries and then select a subset (page) of that sorted list. For example; I have 10.000 items and want to have items 101 until 200.
A naive approach would be to first sort all 10.000 items and then select the page; it would mean that items 1 - 100 and 201 - 10.000 are all unnecessarily fully sorted.
Is there an existing algorithm that will only fully sort the items in the page and stops further sorting of an entry once it is clear it is not in the page? source code in C would be great, but descriptions would also be ok

Suppose you want items p through q out of n. While sorting would cost O(n·log n) time, the operation you mention can be done in O(n) time (so long as q-p « n) as follows: Apply an O(n)-time method to find the pᵗʰ and qᵗʰ values. Then select only items with values from p to q, in time O(n+k) if k=q-p, or about O(n) time, and sort those items in time O(k·log k), which is about O(1), for net time O(n) if k is O(1).

Suppose the page you want starts with the nth "smallest" element (or largest or whatever ordinal scale you prefer). Then you need to divide your partial sorting algorithm into two steps:
Find the nth element
Sort elements {n, n+1, ..., n+s} (where s is the page size)
Quicksort is a sorting algorithm that can be conveniently modified to suit your needs. Basically, it works as follows:
Given: a list L of ordinally related elements.
If L contains exactly one element, return L.
Choose a pivot element p from L at random.
Divide L into two sets: A and B such that A contains all the elements from L which are smaller than p and B contains all the elements from L which are larger.
Apply the algorithm recursively to A and B to obtain the sorted sublists A' and B'.
Return the list A || p || B, where || denotes appending lists or elements.
What you want to do in step #1, is run Quicksort until you've found the nth element. So step #1 will look like this:
Given: a list L of ordinally related elements, a page offset n and a page size s.
Choose a pivot element p from L at random.
Divide L into A and B.
If the size of A, #A = n-1, then return p || B.
If #A < n-1, then apply the algorithm recursively for L' = B and n' = n - #A
If #A > n-1, then apply the algorithm recursively for L' = A and n' = n
This algorithm returns an unsorted list of elements starting with the nth element. Next, run Quicksort on this list but keep ignoring B unless #A < s. At the end, you should have a list of s sorted elements which are larger than n elements from the original list but not larger than n+1 elements from the original list.
The term you want to research is partial sorting. There is likely to be an implementation of it in C or any sufficiently popular language.

Related

How to replace the elements of a range less than k with k?

How do I replace the elements in the range of an array greater than k by k when the number of queries are high?
Given that each query is given by the form l r k; where [l...r] is the range of the array.
Since my first answer created big thread of comments I'm going to combine everything in new answer.
We are going to use Segment Tree as helper data-structure which will be used to answer this question: what is the minimum on range [l, r]. Initially all segment tree nodes will be filled with some "Infinity" numbers which can be 201 in your problem (since all K are lower then 200 based on your comment).
Once we read our input array (lets call it A) we are going to process queries:
for each query [L, R, K] we are going to update our segment tree: try to set new minimum K on range [L, R]. That could be done with O(LogN) using lazy propagation. Here is a great example http://se7so.blogspot.com/2012/12/segment-trees-and-lazy-propagation.html
now we need to build final array. We are iterating over each index in our array and replace it with A[i] = min(A[i], minimum_on_range(i, i)). That will take N * Log(N) steps
Total complexity of that approach is M * Log(N) + N * Log(N)

find nth-smallest value across m sorted arrays using idea from 2 sorted arrays

May I ask whether would it be possible? the general approach would be somehow like find n-th value on two sorted array, to ignore the insignificants and try to focus on the rest by adjusting the value of n in recursion
The 2 sorted arrays problem would yield a computation time O(log(|A|)+log(|B|), while the question is similar, I would like to ask if there exist algorithm for m sorted arrays for time O(log(|A1|)+log(|A2|)+...+log(|Am|)),
or some similar variation that is near the time I mentioned above (due to the variable m, we might need some other sorting algorithm for the pivots from those arrays),
or if such algorithm doesn't exist, why?
I just can't find this algorithm from googling
There is a simple randomized algorithm:
Select a pivot randomly from any of the m arrays. Let's call it x
For every array, do a binary search for x to find out how many values < x are in the array. Say we have ri values < x in array i. We know that x has rank r = sum(i = 1 to m, ri) in the union of all arrays.
If n <= r, we restrict each array i to the indices 0...(ri - 1) and recurse. If n > r, we restrict each array to the indices ri...|Ai | - 1
repeat
The expected recursion depth is O(log(N)) with N being the total number of elements, with a proof similar to that of Quickselect, so the expected running time is something like O(m * log2(N)).
The paper "Generalized Selection and Ranking" by Frederickson and Johnson proposes selection and ranking algorithms for different scenarios, for example an O(m + c * log(k/c)) algorithm to select the k-th element from m equally sized sorted sequences, with c = min{m, k}.

How to locate in a huge list of numbers, two numbers where xi=xj?

I have the following question, and it screams at me for a solution with hashing:
Problem :
Given a huge list of numbers, x1........xn where xi <= T, we'd like to know
whether or not exists two indices i,j, where x_i == x_j.
Find an algorithm in O(n) run time, and also with expectancy of O(n), for the problem.
My solution at the moment : We use hashing, where we'll have a mapping function h(x) using chaining.
First - we build a new array, let's call it A, where each cell is a linked list - this would be the destination array.
Now - we run on all the n numbers and map each element in x1........xn, to its rightful place, using the hash function. This would take O(n) run time.
After that we'll run on A, and look for collisions. If we'll find a cell where length(A[k]) > 1
then we return the xi and xj that were mapped to the value that's stored in A[k] - total run time here would be O(n) for the worst case , if the mapped value of two numbers (if they indeed exist) in the last cell of A.
The same approach can be ~twice faster (on average), still O(n) on average - but with better constants.
No need to map all the elements into the hash and then go over it - a faster solution could be:
for each element e:
if e is in the table:
return e
else:
insert e into the table
Also note that if T < n, there must be a dupe within the first T+1 elements, from pigeonhole principle.
Also for small T, you can use a simple array of size T, no hash is needed (hash(x) = x). Initializing T can be done in O(1) to contain zeros as initial values.

Find the Element Occurring b times in an an array of size n*k+b

Description
Given an Array of size (n*k+b) where n elements occur k times and one element occurs b times, in other words there are n+1 distinct Elements. Given that 0 < b < k find the element occurring b times.
My Attempted solutions
Obvious solution will be using hashing but it will not work if the numbers are very large. Complexity is O(n)
Using map to store the frequencies of each element and then traversing map to find the element occurring b times.As Map's are implemented as height balanced trees Complexity will be O(nlogn).
Both of my solution were accepted but the interviewer wanted a linear solution without using hashing and hint he gave was make the height of tree constant in tree in which you are storing frequencies, but I am not able to figure out the correct solution yet.
I want to know how to solve this problem in linear time without hashing?
EDIT:
Sample:
Input: n=2 b=2 k=3
Aarray: 2 2 2 3 3 3 1 1
Output: 1
I assume:
The elements of the array are comparable.
We know the values of n and k beforehand.
A solution O(n*k+b) is good enough.
Let the number occuring only b times be S. We are trying to find the S in an array of n*k+b size.
Recursive Step: Find the median element of the current array slice as in Quick Sort in lineer time. Let the median element be M.
After the recursive step you have an array where all elements smaller than M occur on the left of the first occurence of M. All M elements are next to each other and all element larger than M are on the right of all occurences of M.
Look at the index of the leftmost M and calculate whether S<M or S>=M. Recurse either on the left slice or the right slice.
So you are doing a quick sort but delving only one part of the divisions at any time. You will recurse O(logN) times but each time with 1/2, 1/4, 1/8, .. sizes of the original array, so the total time will still be O(n).
Clarification: Let's say n=20 and k = 10. Then, there are 21 distinct elements in the array, 20 of which occur 10 times and the last occur let's say 7 times. I find the medium element, let's say it is 1111. If the S<1111 than the index of the leftmost occurence of 1111 will be less than 11*10. If S>=1111 then the index will be equal to 11*10.
Full example: n = 4. k = 3. Array = {1,2,3,4,5,1,2,3,4,5,1,2,3,5}
After the first recursive step I find the median element is 3 and the array is something like: {1,2,1,2,1,2,3,3,3,5,4,5,5,4} There are 6 elements on the left of 3. 6 is a multiple of k=3. So each element must be occuring 3 times there. So S>=3. Recurse on the right side. And so on.
An idea using cyclic groups.
To guess i-th bit of answer, follow this procedure:
Count how many numbers in array has i-th bit set, store as cnt
If cnt % k is non-zero, then i-th bit of answer is set. Otherwise it is clear.
To guess whole number, repeat the above for every bit.
This solution is technically O((n*k+b)*log max N), where max N is maximal value in the table, but because number of bits is usually constant, this solution is linear in array size.
No hashing, memory usage is O(log k * log max N).
Example implementation:
from random import randint, shuffle
def generate_test_data(n, k, b):
k_rep = [randint(0, 1000) for i in xrange(n)]
b_rep = [randint(0, 1000)]
numbers = k_rep*k + b_rep*b
shuffle(numbers)
print "k_rep: ", k_rep
print "b_rep: ", b_rep
return numbers
def solve(data, k):
cnts = [0]*10
for number in data:
bits = [number >> b & 1 for b in xrange(10)]
cnts = [cnts[i] + bits[i] for i in xrange(10)]
return reduce(lambda a,b:2*a+(b%k>0), reversed(cnts), 0)
print "Answer: ", solve(generate_test_data(10, 15, 13), 3)
In order to have a constant height B-tree containing n distinct elements, with height h constant, you need z=n^(1/h) children per nodes: h=log_z(n), thus h=log(n)/log(z), thus log(z)=log(n)/h, thus z=e^(log(n)/h), thus z=n^(1/h).
Example, with n=1000000, h=10, z=3.98, that is z=4.
The time to reach a node in that case is O(h.log(z)). Assuming h and z to be "constant" (since N=n.k, then log(z)=log(n^(1/h))=log(N/k^(1/h))=ct by properly choosing h based on k, you can then say that O(h.log(z))=O(1)... This is a bit far-fetched, but maybe that was the kind of thing the interviewer wanted to hear?
UPDATE: this one use hashing, so it's not a good answer :(
in python this would be linear time (set will remove the duplicates):
result = (sum(set(arr))*k - sum(arr)) / (k - b)
If 'k' is even and 'b' is odd, then XOR will do. :)

Median of Lists

I was asked this question:
You are given two lists of integers, each of which is sorted in ascending order and each of which has length n. All integers in the two lists are different. You wish to find the n-th smallest element of the union of the two lists. (That is, if you concatenated the lists and sorted the resulting list in ascending order, the element which would be at the n-th position.)
My Solution:
Assume that lists are 0-indexed.
O(n) solution:
A straight-forward solution is to observe that the arrays are already sorted,so we can merge them, and stop after n steps. The first n-1 elements do not need to be copied
into a new array, so this solution takes O(n) time and O(1) memory.
O(log2 n) solution:
The O(log2 n) solution uses alternates binary search on each list. In short, it takes the middle element in the current search interval in the first list (l1[p1]) and searches for it in l2. Since the elements are unique, we will find at most 2 values closest to l1[p1]. Depending on their values relative to l1[p1-1] and l1[p1 + 1] and their indices p21 and p22, we either return the n-th element or recurse: If the sum of any out of the (at most) 3 indices in l1 can be combined with one of the (at most) 2 indices in l2 so that l1[p1'] and l2[p2'] would be right next to each other in the sorted union of the two lists and p1' + p2' = n or p1' + p2' = n + 1, we return one of the 5 elements. If p1 + p2 > n, we recurse to left half of the search interval in l1, otherwise we recurse to the right interval. This way, for out of the O(log n) possible midpoints in l1 we do an O(log n) binary search in l2. Therefore the running time is O(log2 n).
O(log n) solution:
Assuming the lists l1 and l2 have constant access time to any of their elements, we
can use a modified version of binary search to get an O(log n) solution. The easiest approach is to search for an index p1 in just one of the lists and calculate the corresponding index p2 in the other list so that p1 + p2 = n at all times. (This assumes the lists are indexed from 1.)
First we check for the special case when all elements of one list are smaller than any element in the other list:
If l1[n] < l2[0] return l1[n].
If l2[n] < l1[0] return l2[n].
If we do not find the n-th smallest element after this step, call findNth(1,n) with the approximate pseudocode:
findNth(start,end)
p1 = (start + end)/2
p2 = n-p1
if l1[p1] < l2[p2]:
if l1[p1 + 1] > l2[p2]:
return l2[p2]
else:
return findNth(p1+1, end)
else:
if l2[p2 + 1] > l1[p1]:
return l1[p1]
else:
return findNth(start,p1-1)
Element l2[p2] is returned when l2[p2] is greater than exactly p1 + p2-1 = n-1 elements
(and therefore is the n-th smallest). l1[p1] is returned under the same but symmetric conditions. If l1[p1] < l2[p2] and l1[p1+1] < l2[p2], the rank of l2[p2] is greater than n, so we need to take more elements from l1 and less from l2. Therefore we search for p1 in the upper half of the previous search interval. On the other hand, if l2[p2] < l1[p1] and l2[p2 + 1] < l1[p1], the rank of l1[p1] is greater than n. Therefore the real p1 will lie in the bottom half of our current search interval.Since we are halving the size of the problem at each call to findNth and we need to do only constant work to halve the problem size, the recurrence for this algorithm is T(n) = T(n/2) +O(1), which has an O(log n)-time solution.
Interviewer continually ask me different approaches for above problem.I have proposed above three approaches.Is they are correct?Is there any other best possible solution for this question? Actually this question asked lot of times so please provide some good stuff about it.
Not sure if you took a look at this: http://www.leetcode.com/2011/01/find-k-th-smallest-element-in-union-of.html
That solve a more generalized version of the problem you are asking about. Definitely log complexity is possible...
I think this will be the best solution . .
->1 2 3 4 5 6 7 8 9
->10 11 12 13 14 15 16 17 18
take two pointers i and j each pointing at start of arrays, increment i if a[i]< b[j]
increment j if a[i]>b[j]
do this n times.
linear O(n) O(1) space solution.

Resources