time complexity of randomized array insertion - arrays

So I had to insert N elements in random order into a size-N array, but I am not sure about the time complexity of the program
the program is basically:
for (i = 0 -> n-1){
index = random (0, n); (n is exclusive)
while (array[index] != null)
index = random (0, n);
array[index] = n
}
Here is my assumption: a normal insertion of N numbers is of course strictly N, but how much cost will the collision from random positions cost? For each n, its collision rate increases like 0, 1/n, 2/n .... n-1/n, so expected number of insertions attempts will be 1, 2, 3 .. n-1, this is O(n), so total time complexity will be O(n^2), so is this the average cost? but wow this is really bad, am I right?
So what will happen if I do a linear search instead of keep trying to generate random numbers? Its worst case will obviously be O(n^2>, but I don't know how to analyze its average case, which depends on average input distribution?

First consider the inner loop. When do we expect to have our first success (find an open position) when there are i values already in the array? For this we use the geometric distribution:
Pr(X = k) = (1-p)^{k-1} p
Where p is the probability of success for an attempt.
Here p is the probability that the array index is not already filled.
There are i filled positions so p = (1 - (i/n)) = ((n - i)/n).
From the wiki, the expectation for the geometric distribution is 1/p = 1 / ((n-i)/n) = n/(n-i).
Therefore, we should expect to make (n / (n - i)) attempts in the inner loop when there are i items in the array.
To fill the array, we insert a new value when the array has i=0..n-1 items in it. The amount of attempts we expect to make overall is the sum:
sum_{i=0,n-1} n/(n-i)
= n * sum_{i=0,n-1}(1/(n-i))
= n * sum_{i=0,n-1}(1/(n-i))
= n * (1/n + 1/(n-1) + ... + 1/1)
= n * (1/1 + ... + 1/(n-1) + 1/n)
= n * sum_{i=1,n}(1/i)
Which is n times the nth harmonic number and is approximately ln(n) + gamma, where gamma is a constant. So overall, the number of attempts is approximately n * (ln(n) + gamma), which is O(nlog n). Remember that this is only the expectation and there is no true upper bound since the inner loop is random; it may never find an open spot.

The expected number of insertions attempt at step i is
sum_{t=0}^infinity (1-i/n)^t * (n-i)/n * t
= (n-i)/n * i/n * (1-i/n)^{-2}
= i/(n-i)
Summing over i you get
sum_{i=0}^{n-1} i/(n-1)
>= sum_{i=n/2}^n i / (n-i)
>= n/2 sum_{x=1}^n/2 1/x
>= n/2 * log(n) + O(n)
And
sum_{i=0}^{n-1} i/(n-i)
<= n * sum _{x=1}^n 1/x
<= n * log(n) + O(n)
So you get exactly n*log(n) as an asymptotic complexity. Which is not as bad as you feared.
About doing a linear search, I don't know how you would do it while keeping the array random. If you really want an efficient algorithm to shuffle your array, you should check out Fisher-Yates shuffle.

Related

Quick select different pivot selection algorithms results

I wrote the following quick-select randomize algorithm that moves the smallest k elements of an array to the beginning of it in linear time (technically worst case O(n^2) but the probability drops exponentially):
// This function moves the smallest k elements of the array to
// the beginning of it in time O(n).
void moveKSmallestValuesToTheLeft( double arr[] ,
unsigned int n ,
unsigned int k )
{
int l = 0, r = n - 1; //Begginning and end indices of the array
while (0 < k && k < n && n > 10)
{
unsigned int partition_index, left_size, pivot;
//Partition the data around a random pivot
pivot = generatePivot(arr, l, n, k); //explained later
partition_index = partition(arr, l, r, pivot); //standard quick sort partition
left_size = partition_index - l + 1;
if (k < left_size)
{
//Continue with left subarray
r = partition_index - 1;
n = partition_index - l;
}
else
{
//Continue with right subarray
l += left_size;
n -= left_size;
k -= left_size;
}
}
if (n <= 10)
insertionSort(arr + l, n);
}
And I tested 3 different methods for generating pivot all of them are based on selecting 5 random candidates and returning one of them, for each method I ran the code 100,000. These were the methods:
Choose random 5 elements and return their median
Choose random 5 elements, calculate k/n and check which element of the 5 is closest to it. I.e, if k/n <= 1/5 return the min if k/n <= 2/5 return the second smallest value, if k/n <= 3/5 return the median and so on.
Exactly the same as method 2 but we give more weight to the pivots closer to the median based on the binomial coefficients of them, i.e. I calculated the binomial coefficients for n=5-1 and got [1 4 6 4 1] then I normalized them and calculated their cum-sum and got [0.0625 0.3125, 0.6875 0.9375 1] and then I did: If k/n <= 0.0625 return the min, if k/n <= 0.3125 return the second smallest value, if k/n <= 0.6875 return the median and so on...
My intuition told me that method 2 would perform the best because it always chooses the pivot that would most likely be closest to the k'th smallest element and therefore would probably decrease k or n the most at each iteration, but instead every time I ran the code I got the following results (ranked fastest method to slowest method based an average and worst case times):
Average running time:
First place (fastest): Method 3
Second place: Method 2
Last place: Method 1
Worst case running time:
First place (fastest): Method 1
Second place: Method 3
Last place: Method 2
My question is is there any mathematical way to explain these results or at least give some intuition to them? Because my intuition was completely wrong method 2 didn't outperform neither of the other 2 methods.
EDIT
So apparently the problem was that I only tested k=n/2 which is an edge case so I got this weird results.

Optimal Algorithm for finding peak element in an array

So far I haven't found any algorithm that solves this task: "An element is
considered as a peak if and only if (A[i]>A[i+1])&&(A[i]>A[i-1]), not
taking into account edges of the array(1D)."
I know that the common approach for this problem is using "Divide & Conquer" but that's in case of taking into consideration the edges as "peaks".
The O(..) complexity I need to get for this exercise is O(log(n)).
By the image above it is clear to me why it is O(log(n)), but without the edges complexity changes to O(n), because in the lower picture I run recursive
function on each side of the middle element, which makes it run in O(n) (worst case scenario in which the element is near the edge). In this case, why not to use a simple binary search like this:
public static int GetPeak(int[]A)
{
if(A.length<=2)//doesn't apply for peak definition
{
return -1;
}
else {
int Element=Integer.MAX_VALUE;//The element which is determined as peak
// First and Second elements can't be hills
for(int i=1;i<A.length-1;i++)
{
if(A[i]>A[i+1]&&A[i]>A[i-1])
{
Element=A[i];
break;
}
else
{
Element=-1;
}
}
return Element;
}
The common algorithm is written here: http://courses.csail.mit.edu/6.006/spring11/lectures/lec02.pdf, but as I said before it doesn't apply for the terms of this exercise.
Return only one peak, else return -1.
Also, my apologies if the post is worded incorrectly due to the language barrier (I am not a native English speaker).
I think what you're looking for is a dynamic programming approach, utilizing divide-and-conquer. Essentially, you would have a default value for your peak which you would overwrite when you found one. If you could check at the beginning of your method and only run operations if you hadn't found a peak, then your O() notation would look something like O(pn) where p is the probability that any given element of your array is a peak, which is a variable term as it relates to how your data is structured (or not). For instance, if your array only has values between 1 and 5 and they're distributed equally then the probability would be equal to 0.24 so you would expect the algorithm to run in O(0.24n). Note that this still appears to be equivalent to O(n). However, if you require that your data values are unique on the array then your probability is equal to:
p = 2 * sum( [ choose(x - 1, 2) for x in 3:n ] ) / choose(n, 3)
p = 2 * sum( [ ((x - 1)! / (2 * (x - 3)!)) for x in 3:n ] ) / (n! / (n - 3)!)
p = sum( [ (x - 1) * (x - 2) for x in 3:n ] ) / (n * (n - 1) * (n - 2))
p = ((n * (n + 1) * (2 * n + 1)) / 6 - (n * (n + 1)) + 2 * n - 8) / (n * (n - 1) * (n - 2))
p = ((1 / 3) * n^3 - 5.5 * n^2 + 6.5 * n - 8) / (n * (n - 1) * (n - 2))
So, this seems like a lot but if we take the limit as n approaches infinity then we wind up with a value for p that is near 1/3.
So, if we have a 33% chance of finding a peak at any element on the array, then at the bottom level of your recursion when you have a 1/3 probability of finding a peak. So, the expected value of this is around 3 comparisons before you find one, which means a constant time. However, you still have to get to the bottom level of your recursion before you can do the comparisons and that requires O(log(n)) time. So, a divide-and-conquer approach should run in O(log(n)) time in the average case with O(n log(n)) in the worst case.
If you cannot make any assumptions about your data (monotonicity of the number sequence, number of peaks), and if edges cannot count as peaks, then you cannot hope for a better average performance than O(n). Your data is randomly distributed, and any value can be a peak. You have to examine them one by one, and there is no correlation between the values.
Accepting edges as potential candidates for peaks changes everything: you know there will always be at least one peak, and a good enough strategy is to always search in the direction of increasing values until you start to go down or you reach an edge (this is the one of the document you provided). That strategy is O(nlog(n)) because you use binary search to look for a local max.

Given a set of numbers,find the pair which has the least LCM(Lowest Common Multiple)

I used this approach.
Found all possible nC2 pairs possible for n numbers.
Then individually found thier LCM by computing their GCD and dividing the product of the two numbers by thier GCD.
Also maintained a variable which contained the least LCM value computed till then and finally output it.
But this naive approach seems inefficient when the number values are very large (~10^9) since time complexity of GCD will depend on the magnitude of the number. Also it will be infeasible for very large values of N.
Is there any other better approach to this problem?
For a large number of digits, an efficient implementation of the Euclidean algorithm (https://en.wikipedia.org/wiki/Euclidean_algorithm#Algorithmic_efficiency) for finding the GCD is the best route I can think of. There is no fast, general algorithm for prime factorization so using that to reduce the problem won't help the run time. I'm not aware of any fast reductions that would help with this.
Addressing large N, I think this is what others have been getting at:
Sort the array
Start with lowest values and calculate LCMs (using the Euclidean algorithm for example for the GCD part) with a short circuit: stop processing once the LCM of the remaining pairs cannot be less than the best found so far. Note that, for two numbers in the sorted set, b < c,the lower bound of the LCM is (b * c) / b = c (this occurs when b divides c). See working code below (short_lcm version).
There are other optimizations that can be made to this (such as not writing it in python :) ) but this demonstrates the algorithm improvement:
import fractions
def lcm(a, b):
return abs(a * b) / fractions.gcd(a, b)
def short_lcm(a):
a = sorted(a)
iterations = 0
cur_lcm = lcm(a[0], a[1])
first = a[0]
second = a[1]
for i in range(0, len(a)):
#Best case for remaining pairs
if i < len(a) - 1 and a[i + 1] >= cur_lcm: break
for j in range(i + 1, len(a)): #Starting at i + 1 avoids duplicates
if a[j] >= cur_lcm: break #Best case for remaining pairs
iterations += 1
test = lcm(a[i], a[j])
if test < cur_lcm:
cur_lcm = test
first = a[i]
second = a[j]
if iterations < 1: iterations = 1
print("Lowest LCM pair is (%d, %d): %d. Found in %d iterations" % (
first, second, cur_lcm, iterations))
def long_lcm(a):
iterations = 0
cur_lcm = lcm(a[0], a[1])
first = a[0]
second = a[1]
for i in range(0, len(a)):
for j in range(i + 1, len(a)): #Starting at i + 1 avoids duplicates
iterations += 1
test = lcm(a[i], a[j])
if test < cur_lcm:
cur_lcm = test
first = a[i]
second = a[j]
print("Lowest LCM pair is (%d, %d): %d. Found in %d iterations" % (
first, second, cur_lcm, iterations))
if __name__ == '__main__':
from random import randint
import time
a = [randint(1, 1000) for r in xrange(100)]
#Only print the list if it's relatively short
if len(a) < 20: print a
#Try all pairs
start = time.clock()
long_lcm(a)
print "Slow version time: %f\n" % (time.clock() - start)
start = time.clock()
short_lcm(a)
print "Fast version time: %f" % (time.clock() - start)
I don't think there is an efficient algorithm for this.
You can always use heuristics and simple which will definitely work for this problem.
On average, for most arrays, the pair of numbers will be like a,b(a<b) where LCM(a,b) ~ O(b), i.e. most of a's factors are contained in b and hence LCM will be approximately of O(b).
Hence on average, LCM will not be very large and similar to elements of the arrays.
So, Idea is to sort the array, and try with smaller a,b first in increasing order. And when b > lcm_so_far .
Thanks

Absolute distance from various points in O(n)

I am stuck in question. The part of the question requires to calculate sum of absolute distance of a point from various points.
|x - x1| + |x - x2| + |x - x3| + |x - x4| ....
I have to calculate this distance in O(n) for every point while iterating in array for eg:
array = { 3,5,4,7,5}
sum of distance from previous points
dis[0] = 0;
dis[1] = |3-5| = 2
dis[2] = |3-4| + |5-4| = 2
dis[3] = |3-7| + |5-7| + |4-7| = 9
dis[4] = |3-5| + |5-5| + |4-5| + |7-5| = 5
Can anyone suggest the algo to do this ?
Algorithm for less than O(n^2) will be appreciated ( not necessarily O(n)).
Code for O(n^2)
REP(i,n){
LL ans = 0;
for(int j=0;j<i;j++)
ans= ans + abs(a[i]-a[j])
dis[i]=ans;
}
O(n log n) algorithm is possible.
Assume we had a datastructure for a list of integers which supported:
Insert(x)
SumGreater(x)
SumLesser(x)
Insert(x) inserts x into the list.
SumGreater(x) gives the sum of all elements greater than x, which are in the list.
SumLesser(x) gives the sum of elements < x.
NumGreater(x) gives the number of all elements greater than x.
NumLesser(x) gives the number of all elements < x.
Using balanced binary trees, with cumulative sub-tree sums and sub-tree counts stored in the nodes, we can implement each operation in O(log n) time.
To use this structure for your question.
Walk the array left to right, and When you encounter a new element x
You query the already inserted numbers for SumGreater(x) = G and SumLesser(x) = L and NumGreater(x) = n_G and NumLesser(x) = n_L
The value for x would be (G - n_G*x) + (n_L*x-L).
Then you insert x and continue.
Is O(n) even possible? - If the size of your output is 1/2 * n^2, how can you populate it in O(n) time?

Find the median of the sum of the arrays

Two sorted arrays of length n are given and the question is to find, in O(n) time, the median of their sum array, which contains all the possible pairwise sums between every element of array A and every element of array B.
For instance: Let A[2,4,6] and B[1,3,5] be the two given arrays.
The sum array is [2+1,2+3,2+5,4+1,4+3,4+5,6+1,6+3,6+5]. Find the median of this array in O(n).
Solving the question in O(n^2) is pretty straight-forward but is there any O(n) solution to this problem?
Note: This is an interview question asked to one of my friends and the interviewer was quite sure that it can be solved in O(n) time.
The correct O(n) solution is quite complicated, and takes a significant amount of text, code and skill to explain and prove. More precisely, it takes 3 pages to do so convincingly, as can be seen in details here http://www.cse.yorku.ca/~andy/pubs/X+Y.pdf (found by simonzack in the comments).
It is basically a clever divide-and-conquer algorithm that, among other things, takes advantage of the fact that in a sorted n-by-n matrix, one can find in O(n) the amount of elements that are smaller/greater than a given number k. It recursively breaks down the matrix into smaller submatrixes (by taking only the odd rows and columns, resulting in a submatrix that has n/2 colums and n/2 rows) which combined with the step above, results in a complexity of O(n) + O(n/2) + O(n/4)... = O(2*n) = O(n). It is crazy!
I can't explain it better than the paper, which is why I'll explain a simpler, O(n logn) solution instead :).
O(n * logn) solution:
It's an interview! You can't get that O(n) solution in time. So hey, why not provide a solution that, although not optimal, shows you can do better than the other obvious O(n²) candidates?
I'll make use of the O(n) algorithm mentioned above, to find the amount of numbers that are smaller/greater than a given number k in a sorted n-by-n matrix. Keep in mind that we don't need an actual matrix! The Cartesian sum of two arrays of size n, as described by the OP, results in a sorted n-by-n matrix, which we can simulate by considering the elements of the array as follows:
a[3] = {1, 5, 9};
b[3] = {4, 6, 8};
//a + b:
{1+4, 1+6, 1+8,
5+4, 5+6, 5+8,
9+4, 9+6, 9+8}
Thus each row contains non-decreasing numbers, and so does each column. Now, pretend you're given a number k. We want to find in O(n) how many of the numbers in this matrix are smaller than k, and how many are greater. Clearly, if both values are less than (n²+1)/2, that means k is our median!
The algorithm is pretty simple:
int smaller_than_k(int k){
int x = 0, j = n-1;
for(int i = 0; i < n; ++i){
while(j >= 0 && k <= a[i]+b[j]){
--j;
}
x += j+1;
}
return x;
}
This basically counts how many elements fit the condition at each row. Since the rows and columns are already sorted as seen above, this will provide the correct result. And as both i and j iterate at most n times each, the algorithm is O(n) [Note that j does not get reset within the for loop]. The greater_than_k algorithm is similar.
Now, how do we choose k? That is the logn part. Binary Search! As has been mentioned in other answers/comments, the median must be a value contained within this array:
candidates[n] = {a[0]+b[n-1], a[1]+b[n-2],... a[n-1]+b[0]};.
Simply sort this array [also O(n*logn)], and run the binary search on it. Since the array is now in non-decreasing order, it is straight-forward to notice that the amount of numbers smaller than each candidate[i] is also a non-decreasing value (monotonic function), which makes it suitable for the binary search. The largest number k = candidate[i] whose result smaller_than_k(k) returns smaller than (n²+1)/2 is the answer, and is obtained in log(n) iterations:
int b_search(){
int lo = 0, hi = n, mid, n2 = (n²+1)/2;
while(hi-lo > 1){
mid = (hi+lo)/2;
if(smaller_than_k(candidate[mid]) < n2)
lo = mid;
else
hi = mid;
}
return candidate[lo]; // the median
}
Let's say the arrays are A = {A[1] ... A[n]}, and B = {B[1] ... B[n]}, and the pairwise sum array is C = {A[i] + B[j], where 1 <= i <= n, 1 <= j <= n} which has n^2 elements and we need to find its median.
Median of C must be an element of the array D = {A[1] + B[n], A[2] + B[n - 1], ... A[n] + B[1]}: if you fix A[i], and consider all the sums A[i] + B[j], you would see that the only A[i] + B[j = n + 1 - i] (which is one of D) could be the median. That is, it may not be the median, but if it is not, then all other A[i] + B[j] are also not median.
This can be proved by considering all B[j] and count the number of values that are lower and number of values that are greater than A[i] + B[j] (we can do this quite accurately because the two arrays are sorted -- the calculation is a bit messy thought). You'd see that for A[i] + B[n + 1 - j] these two counts are most "balanced".
The problem then reduces to finding median of D, which has only n elements. An algorithm such as Hoare's will work.
UPDATE: this answer is wrong. The real conclusion here is that the median is one of D's element, but then D's median is the not the same as C's median.
Doesn't this work?:
You can compute the rank of a number in linear time as long as A and B are sorted. The technique you use for computing the rank can also be used to find all things in A+B that are between some lower bound and some upper bound in time linear the size of the output plus |A|+|B|.
Randomly sample n things from A+B. Take the median, say foo. Compute the rank of foo. With constant probability, foo's rank is within n of the median's rank. Keep doing this (an expected constant number of times) until you have lower and upper bounds on the median that are within 2n of each other. (This whole process takes expected linear time, but it's obviously slow.)
All you have to do now is enumerate everything between the bounds and do a linear-time selection on a linear-sized list.
(Unrelatedly, I wouldn't excuse the interviewer for asking such an obviously crappy interview question. Stuff like this in no way indicates your ability to code.)
EDIT: You can compute the rank of a number x by doing something like this:
Set i = j = 0.
While j < |B| and A[i] + B[j] <= x, j++.
While i < |A| {
While A[i] + B[j] > x and j >= 0, j--.
If j < 0, break.
rank += j+1.
i++.
}
FURTHER EDIT: Actually, the above trick only narrows down the candidate space to about n log(n) members of A+B. Then you have a general selection problem within a universe of size n log(n); you can do basically the same trick one more time and find a range of size proportional to sqrt(n) log(n) where you do selection.
Here's why: If you sample k things from an n-set and take the median, then the sample median's order is between the (1/2 - sqrt(log(n) / k))th and the (1/2 + sqrt(log(n) / k))th elements with at least constant probability. When n = |A+B|, we'll want to take k = sqrt(n) and we get a range of about sqrt(n log n) elements --- that's about |A| log |A|. But then you do it again and you get a range on the order of sqrt(n) polylog(n).
You should use a selection algorithm to find the median of an unsorted list in O(n). Look at this: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

Resources