Cannot understand CLRS Problem 4-2 case 2 - arrays

The question:
4-2 Parameter-passing costs
Throughout this book, we assume that parameter passing during procedure calls takes constant time, even if an N-element array is being passed. This assumption is valid in most systems because a pointer to the array is passed, not the array itself.
This problem examines the implications of three parameter-passing strategies:
An array is passed by pointer. Time Theta(1)
2. An array is passed by copying. Time Theta(N), where N is the size of the array.
An array is passed by copying only the subrange that might be accessed by the called procedure. Time Theta(q-p+1) if the subarray A[p..q] is passed.
a. Consider the recursive binary search algorithm for finding a number in a sorted array (see Exercise 2.3-5 ). Give recurrences for the worst-case running times of binary search when arrays are passed using each of the three methods above, and give good upper bounds on the solutions of the recurrences. Let N be the size of the original problem and n be the size of a subproblem.
b. Redo part (a) for the MERGE-SORT algorithm from Section 2.3.1.
I have trouble understanding how to solve the recurrence of case 2 where arrays are passed by copying for both algorithms.
Take the binary search algorithm of case 2 for example, the recurrence most answers give is T(n)=T(n / 2)+Theta(N). I have no trouble about that.
Here is an answer I find online that looks correct:
T(n)
= T(n/2) + cN
= 2cN + T(n/4)
= 3cN + T(n/8)
= Sigma(i = 0 to lgn - 1) (2^i cN / (2^i))
= cNlgn
= Theta(nlgn)
I have trouble understanding how the second last line can result in the last line's answer. Why not represent it in Theta(Nlgn)? And how can N become n in the Theta notation?
N and n feel somewhat connected. I have trouble understanding their connection and how that is applied in the solution.

Seems that N represents full array length and n is current chunk size.
But formulas really exploit initial value only, when you start from full length n=N - for example, look at T(n/4) for T(N/4), so n===N everywhere.
In this case you can replace n with N.

My answer will not be very theoretical, but maybe this "more empirical" approach will help you figure it out. Also check Master Theorem (analysis of algorithms) for easier analysis of recursive algorithms complexity
Let's solve the binary search first:
By pointer
O(logN) - acts like an iterative binary search, there will be logN calls each having O(1) complexity
Copying the whole array
O(NlogN) - since for each of the logN calls of the function we copy N elements
Copying only the accessed subarray
O(N) - this one is not that obvious, but can be easily seen that the copied subarrays will be of length, N, N/2, N/4, N/8 .... and summing all this terms will be 2*N
Now for the merge sort algorithm:
O(NlogN) - the same method as for a3, the iterated subarrays will be of lengths N, (N/2, N/2), (N/4, N/4, N/4, N/4) ...
O(N^2) - we make 2*N calls of the sorting function and each has a complexity of O(N) for copying the whole array
O(NlogN) - we will copy only the subarrays that we will iterate over, so the complexity will be as in b1

Related

what would be the time complexity of this algorithm?

i was wondering what would be the time complexity of this piece of code?
last = 0
ans = 0
array = [1,2,3,3,3,4,5,6]
for number in array:
if number != last then: ans++;
last = number
return ans
im thinking O(n^2) as we look at all the array elements twice, once in executing the for loop and then another time when comparing the two subsequent values, but I am not sure if my guess is correct.
While processing each array element, you just make one comparison, based on which you update ans and last. The complexity of the algorithm stands at O(n), and not O(n^2).
The answer is actually O(1) for this case, and I will explain why after explaining why a similar algorithm would be O(n) and not O(n^2).
Take a look at the following example:
def do_something(array):
for number in array:
if number != last then: ans++;
last = number
return ans
We go through each item in the array once, and do two operations with it.
The rule for time complexity is you take the largest component and remove a factor.
if we actually wanted to calculate the exact number of operaitons, you might try something like:
for number in array:
if number != last then: ans++; # n times
last = number # n times
return ans # 1 time
# total number of instructions = 2 * n + 1
Now, Python is a high level language so some of these operations are actually multiple operations put together, so that instruction count is not accurate. Instead, when discussing complexity we just take the largest contributing term (2 * n) and remove the coefficient to get (n). big-O is used when discussing worst case, so we call this O(n).
I think your confused because the algorithm you provided looks at two numbers at a time. the distinction you need to understand is that your code only "looks at 2 numbers at a time, once for each item in the array". It does not look at every possible pair of numbers in the array. Even if your code looked at half of the number of possible pairs, this would still be O(n^2) because the 1/2 term would be excluded.
Consider this code that does, here is an example of an O(n^2) algorithm.
for n1 in array:
for n2 in array:
print(n1 + n2)
In this example, we are looking at each pair of numbers. How many pairs are there? There are n^2 pairs of numbers. Contrast this with your question, we look at each number individually, and compare with last. How many pairs of number and last are there? At worst, 2 * n, which we call O(n).
I hope this clears up why this would be O(n) and not O(n^2). However, as I said at the beginning of my answer this is actually O(1). This is because the length of the array is specifically 8, and not some arbitrary length n. Every time you execute this code it will take the same amount of time, it doesn't vary with anything and so there is no n. n in my example was the length of the array, but there is no such length term provided in your example.

Finding the Average case complexity of an Algorithm

I have an algorithm for Sequential search of an unsorted array:
SequentialSearch(A[0..n-1],K)
i=0
while i < n and A[i] != K do
i = i+1
if i < n then return i
else return -1
Where we have an input array A[0...n-1] and a search key K
I know that the worst case is n, because we would have to search the entire array, hence n items O(n)
I know that the best case is 1, since that would mean the first item we search is the one we want, or the array has all the same items, either way it's O(1)
But I have no idea on how to calculate the average case. The answer my textbook gives is:
= (p/n)[1+2+...+i+...+n] + n(1-p)
is there a general formula I can follow for when I see an algorithm like this one, to calculate it?
PICTURE BELOW
Textbook example
= (p/n)[1+2+...+i+...+n] + n(1-p)
p here is the probability of an search key found in the array, since we have n elements, we have p/n as the probability of finding the key at the particular index within n . We essentially doing weighted average as in each iteration, we weigh in 1 comparison, 2 comparison, and until n comparison. Because we have to take all inputs into account, the second part n(1-p) tells us the probability of input that doesn't exist in the array 1-p. and it takes n as we search through the entire array.
You'd need to consider the input cases, something like equivalence classes of input, which depends on the context of the algorithm. If none of those things are known, then assuming that the input is an array of random integers, the average case would probably be O(n). This is because, roughly, you have no way of proving to a useful extent how often your query will be found in an array of N integer values in the range of ~-32k to ~32k.
More formally, let X be a discrete random variable denoting the number of elements of the array A that are needed to be scanned. There are n elements and since all positions are equally likely for inputs generated randomly, X ~ Uniform(1,n) where X = 1,..,n, given that search key is found in the array (with probability p), otherwise all the elements need to be scanned, with X=n (with probability 1-p).
Hence, P(X=x)=(1/n).p.I{x<n}+((1/n).p+(1-p)).I{x=n} for x = 1,..,n, where I{x=n} is the indicator function and will have value 1 iff x=n otherwise 0.
Average time complexity of the algorithm is the expected time taken to execute the algorithm when the input is an arbitrary sequence. By definition,
The following figure shows how time taken for searching the array changes with n and p.

Algorithm for finding if there's a "common number"

Let an array with the size of n. We need to write an algorithm which checks if there's a number which appears at least n/loglogn times.
I've understood that there's a way doing it in O(n*logloglogn) which goes something like this:
Find the median using select algorithm and count how many times it appears. if it appears more than n/loglogn we return true. It takes O(n).
Partition the array according the median. It takes O(n)
Apply the algorithm on both sides of the partition (two n/2 arrays).
If we reached a subarray of size less than n/loglogn, stop and return false.
Questions:
Is this algorithm correct?
The recurrence is: T(n) = 2T(n/2) + O(n) and the base case is T(n/loglogn) = O(1). Now, the largest number of calls in the recurrence-tree is O(logloglogn) and since every call is O(n) then the time complexity is O(n*logloglogn). Is that correct?
The suggested solution works, and the complexity is indeed O(n/logloglog(n)).
Let's say a "pass i" is the running of all recursive calls of depth i. Note that each pass requires O(n) time, since while each call is much less than O(n), there are several calls - and overall, each element is processed once in each "pass".
Now, we need to find the number of passes. This is done by solving the equation:
n/log(log(n)) = n / 2^x
<->
n/log(log(n)) * 2^x = n
And the idea is each call is dividing the array by half until you get to the predefined size of n/log(log(n)).
This problem is indeed solved for x in O(n/log(log(log(n))), as you can see in wolfram alpha, and thus the complexity is indeed O(nlog(log(log(n))))
As for correctness - that's because if an element repeats more than the required - it must be in some subarray with size greater/equals the required size, and by reducing constantly the size of the array, you will arrive to a case at some point where #repeats <= size(array) <= #repeats - at this point, you are going to find this element as the median, and find out it's indeed a "frequent item".
Some other approach, in O(n/log(log(n)) time - but with great constants is suggested by Karp-Papadimitriou-Shanker, and is based on filling a table with "candidates" while processing the array.

Find the minimum number of elements required so that their sum equals or exceeds S

I know this can be done by sorting the array and taking the larger numbers until the required condition is met. That would take at least nlog(n) sorting time.
Is there any improvement over nlog(n).
We can assume all numbers are positive.
Here is an algorithm that is O(n + size(smallest subset) * log(n)). If the smallest subset is much smaller than the array, this will be O(n).
Read http://en.wikipedia.org/wiki/Heap_%28data_structure%29 if my description of the algorithm is unclear (it is light on details, but the details are all there).
Turn the array into a heap arranged such that the biggest element is available in time O(n).
Repeatedly extract the biggest element from the heap until their sum is large enough. This takes O(size(smallest subset) * log(n)).
This is almost certainly the answer they were hoping for, though not getting it shouldn't be a deal breaker.
Edit: Here is another variant that is often faster, but can be slower.
Walk through elements, until the sum of the first few exceeds S. Store current_sum.
Copy those elements into an array.
Heapify that array such that the minimum is easy to find, remember the minimum.
For each remaining element in the main array:
if min(in our heap) < element:
insert element into heap
increase current_sum by element
while S + min(in our heap) < current_sum:
current_sum -= min(in our heap)
remove min from heap
If we get to reject most of the array without manipulating our heap, this can be up to twice as fast as the previous solution. But it is also possible to be slower, such as when the last element in the array happens to be bigger than S.
Assuming the numbers are integers, you can improve upon the usual n lg(n) complexity of sorting because in this case we have the extra information that the values are between 0 and S (for our purposes, integers larger than S are the same as S).
Because the range of values is finite, you can use a non-comparative sorting algorithm such as Pigeonhole Sort or Radix Sort to go below n lg(n).
Note that these methods are dependent on some function of S, so if S gets large enough (and n stays small enough) you may be better off reverting to a comparative sort.
Here is an O(n) expected time solution to the problem. It's somewhat like Moron's idea but we don't throw out the work that our selection algorithm did in each step, and we start trying from an item potentially in the middle rather than using the repeated doubling approach.
Alternatively, It's really just quickselect with a little additional book keeping for the remaining sum.
First, it's clear that if you had the elements in sorted order, you could just pick the largest items first until you exceed the desired sum. Our solution is going to be like that, except we'll try as hard as we can to not to discover ordering information, because sorting is slow.
You want to be able to determine if a given value is the cut off. If we include that value and everything greater than it, we meet or exceed S, but when we remove it, then we are below S, then we are golden.
Here is the psuedo code, I didn't test it for edge cases, but this gets the idea across.
def Solve(arr, s):
# We could get rid of worse case O(n^2) behavior that basically never happens
# by selecting the median here deterministically, but in practice, the constant
# factor on the algorithm will be much worse.
p = random_element(arr)
left_arr, right_arr = partition(arr, p)
# assume p is in neither left_arr nor right_arr
right_sum = sum(right_arr)
if right_sum + p >= s:
if right_sum < s:
# solved it, p forms the cut off
return len(right_arr) + 1
# took too much, at least we eliminated left_arr and p
return Solve(right_arr, s)
else:
# didn't take enough yet, include all elements from and eliminate right_arr and p
return len(right_arr) + 1 + Solve(left_arr, s - right_sum - p)
One improvement (asymptotically) over Theta(nlogn) you can do is to get an O(n log K) time algorithm, where K is the required minimum number of elements.
Thus if K is constant, or say log n, this is better (asymptotically) than sorting. Of course if K is n^epsilon, then this is not better than Theta(n logn).
The way to do this is to use selection algorithms, which can tell you the ith largest element in O(n) time.
Now do a binary search for K, starting with i=1 (the largest) and doubling i etc at each turn.
You find the ith largest, and find the sum of the i largest elements and check if it is greater than S or not.
This way, you would run O(log K) runs of the selection algorithm (which is O(n)) for a total running time of O(n log K).
eliminate numbers < S, if you find some number ==S, then solved
pigeon-hole sort the numbers < S
Sum elements highest to lowest in the sorted order till you exceed S.

Find duplicate entry in array of integers

As a homework question, the following task had been given:
You are given an array with integers between 1 and 1,000,000. One
integer is in the array twice. How can you determine which one? Can
you think of a way to do it using little extra memory.
My solutions so far:
Solution 1
List item
Have a hash table
Iterate through array and store its elements in hash table
As soon as you find an element which is already in hash table, it is
the dup element
Pros
It runs in O(n) time and with only 1 pass
Cons
It uses O(n) extra memory
Solution 2
Sort the array using merge sort (O(nlogn) time)
Parse again and if you see a element twice you got the dup.
Pros
It doesn't use extra memory
Cons
Running time is greater than O(n)
Can you guys think of any better solution?
The question is a little ambiguous; when the request is "which one," does it mean return the value that is duplicated, or the position in the sequence of the duplicated one? If the former, any of the following three solutions will work; if it is the latter, the first is the only that will help.
Solution #1: assumes array is immutable
Build a bitmap; set the nth bit as you iterate through the array. If the bit is already set, you've found a duplicate. It runs on linear time, and will work for any size array.
The bitmap would be created with as many bits as there are possible values in the array. As you iterate through the array, you check the nth bit in the array. If it is set, you've found your duplicate. If it isn't, then set it. (Logic for doing this can be seen in the pseudo-code in this Wikipedia entry on Bit arrays or use the System.Collections.BitArray class.)
Solution #2: assumes array is mutable
Sort the array, and then do a linear search until the current value equals the previous value. Uses the least memory of all. Bonus points for altering the sort algorithm to detect the duplicate during a comparison operation and terminating early.
Solution #3: (assumes array length = 1,000,001)
Sum all of the integers in the array.
From that, subtract the sum of the integers 1 through 1,000,000 inclusive.
What's left will be your duplicated value.
This take almost no extra memory, can be done in one pass if you calculate the sums at the same time.
The disadvantage is that you need to do the entire loop to find the answer.
The advantages are simplicity, and a high probability it will in fact run faster than the other solutions.
Assuming all the numbers from 1 to 1,000,000 are in the array, the sum of all numbers from 1 to 1,000,000 is (1,000,000)*(1,000,000 + 1)/2 = 500,000 * 1,000,001 = 500,000,500,000.
So just add up all the numbers in the array, subtract 500,000,500,000, and you'll be left with the number that occured twice.
O(n) time, and O(1) memory.
If the assumption isn't true, you could try using a Bloom Filter - they can be stored much more compactly than a hash table (since they only store fact of presence), but they do run the risk of false positives. This risk can be bounded though, by our choice of how much memory to spend on the bloom filter.
We can then use the bloom filter to detect potential duplicates in O(n) time and check each candidate in O(n) time.
This python code is a modification of QuickSort:
def findDuplicate(arr):
orig_len = len(arr)
if orig_len <= 1:
return None
pivot = arr.pop(0)
greater = [i for i in arr if i > pivot]
lesser = [i for i in arr if i < pivot]
if len(greater) + len(lesser) != orig_len - 1:
return pivot
else:
return findDuplicate(lesser) or findDuplicate(greater)
It finds a duplicate in O(n logn)), I think. It uses extra memory on the stack, but it can be rewritten to use only one copy of the original data, I believe:
def findDuplicate(arr):
orig_len = len(arr)
if orig_len <= 1:
return None
pivot = arr.pop(0)
greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
if len(arr):
return pivot
else:
return findDuplicate(lesser) or findDuplicate(greater)
The list comprehensions that produce greater and lesser destroy the original with calls to pop(). If arr is not empty after removing greater and lesser from it, then there must be a duplicate and it must be pivot.
The code suffers from the usual stack overflow problems on sorted data, so either a random pivot or an iterative solution which queues the data is necessary:
def findDuplicate(full):
import copy
q = [full]
while len(q):
arr = copy.copy(q.pop(0))
orig_len = len(arr)
if orig_len > 1:
pivot = arr.pop(0)
greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
if len(arr):
return pivot
else:
q.append(greater)
q.append(lesser)
return None
However, now the code needs to take a deep copy of the data at the top of the loop, changing the memory requirements.
So much for computer science. The naive algorithm clobbers my code in python, probably because of python's sorting algorithm:
def findDuplicate(arr):
arr = sorted(arr)
prev = arr.pop(0)
for element in arr:
if element == prev:
return prev
else:
prev = element
return None
Rather than sorting the array and then checking, I would suggest writing an implementation of a comparison sort function that exits as soon as the dup is found, leading to no extra memory requirement (depending on the algorithm you choose, obviously) and a worst case O(nlogn) time (again, depending on the algorithm), rather than a best (and average, depending...) case O(nlogn) time.
E.g. An implementation of in-place merge sort.
http://en.wikipedia.org/wiki/Merge_sort
Hint: Use the property that A XOR A == 0, and 0 XOR A == A.
As a variant of your solution (2), you can use radix sort. No extra memory, and will run in
linear time. You can argue that time is also affected by the size of numbers representation, but you have already given bounds for that: radix sort runs in time O(k n), where k is the number of digits you can sort ar each pass. That makes the whole algorithm O(7n)for sorting plus O(n) for checking the duplicated number -- which is O(8n)=O(n).
Pros:
No extra memory
O(n)
Cons:
Need eight O(n) passes.
And how about the problem of finding ALL duplicates? Can this be done in less than
O(n ln n) time? (Sort & scan) (If you want to restore the original array, carry along the original index and reorder after the end, which can be done in O(n) time)
def singleton(array):
return reduce(lambda x,y:x^y, array)
Sort integer by sorting them on place they should be. If you get "collision" than you found the correct number.
space complexity O(1) (just same space that can be overwriten)
time complexity less than O(n) becuse you will statisticaly found collison before getting on the end.

Resources