Testing whether or not an array is distinct in O(N) time and O(1) extra space - is it possible? - arrays

So I found this purported interview question(1), that looks something like this
Given an array of length n of integers with unknown range, find in O(n) time and O(1) extra space whether or not it contains any duplicate terms.
There are no additional conditions and restrictions given. Assume that you can modify the original array. If it helps, you can restrict the datatype of the integers to ints (the original wording was a bit ambiguous) - although try not to use a variable with 2^(2^32) bits to represent a hash map.
I know there is a solution for a similar problem, where the maximum integer in the array is restricted to n-1. I am aware that problems like
Count frequencies of all elements in array in O(1) extra space and O(n) time
Find the maximum repeating number in O(n) time and O(1) extra space
Algorithm to determine if array contains n…n+m?
exist and either have solutions, or answers saying that it is impossible. However, for 1. and 2. the problems are stronger than this one, and for 3. I'm fairly sure the solution offered there would require the additional n-1 constraint to be adapted for the task here.
So is there any solution to this, or is this problem unsolvable? If so, is there a proof that it is not solvable in O(n) time and O(1) extra space?
(1) I say purported - I can't confirm whether or not it is an actual interview question, so I can't confirm that anyone thought it was solvable in the first place.

We can sort integer arrays in O(N) time! Therefore, sort and run the well-known algorithm for adjacent distinct.
bool distinct(int array[], size_t n)
{
if (n > 0xFFFFFFFF)
return true; // Pigeonhole
else if (n > 0x7FFFFFFF)
radix_sort(array, n); // Yup O(N) sort
else
heapsort(array, n); // N is small enough that heapsort's O(N log (N)) is smaller than radix_sort's O(32N) after constant adjust
for (size_t i = 1; i < n; i++)
if (array[i] == array[i - 1])
return true;
return false;
}

You can do this in expected linear time by using the original array like a hash table...
Iterate through the array, and for each item, let item, index be the item and its index, and let hash(item) be a value in [0,n). Then:
If hash(item) == index, then just leave the item there and move on. Otherwise,
If item == array[hash(item)] then you've found a duplicate and you're all done. Otherwise,
If item < array[hash(item)] or hash(array[hash(item)]) != hash(item), then swap those and repeat with the new item at array[index]. Otherwise,
Leave the item and move on.
Now you can discard all the array elements where hash(item) == index. These are guaranteed to be the smallest items that hash to their target indexes, and they are guaranteed not to be duplicates.
Move all the remaining items to the front of the array and repeat with the new, smaller, subarray.
Each step takes O(N) time, and on average will remove some significant proportion of the remaining elements, leading to O(N) time overall. We can speed things up by taking advantage all the free slots we're creating in the array, but that doesn't improve the overall complexity.

Related

Can we find mode of an array without hashmap in un sorted array in O(n) time

Can we find the mode of an array in O(n) time without using Additional O(n) space, nor Hash. Moreover the data is not sorted?
The problem is not easier then Element distinctness problem1 - so basically without the additional space - the problem's complexity is Theta(nlogn) at best (and since it can be done in Theta(nlogn) - it is ineed the case).
So basically - if you cannot use extra space for the hash table, best is sort and iterate, which is Theta(nlogn).
(1) Given an algorithm A that runs in O(f(n)) for this problem, it is easy to see that one can run A and then verify that the resulting element repeats more then once with an extra iteration to solve the element distinctness problem in O(f(n) + n).
Under the right circumstances, yes. Just for example, if your data is amenable to a radix sort, then you can sort with only constant extra space in linear time, followed by a linear scan through the sorted data to find the mode.
If your data requires comparison-based sorting, then I'm pretty sure O(N log N) is about as well as you can do in the general case.
Just count the frequencies. This is not O(n) space, it is O(k), with k being the number of distinct values in the range. This is actually constant space.
Time is clearly linear O(n)
//init
counts = array[k]
for i = 0 to k
counts[i] = 0
maxCnt = 0
maxVal = vals[0]
for val in vals
counts[val]++
if (counts[val] > maxCnt)
maxCnt = counts[val]
maxVal = val
The main problem here, is that while k may be a constant, it may also be very very huge. However, k could also be small. Regardless, this does properly answer your question, even if it isn't practical.

Inserting unknown number of elements into dynamic array in linear time

(This question is inspired by deque::insert() at index?, I was surprised that it wasn't covered in my algorithm lecture and that I also didn't find it mentioned in another question here and even not in Wikipedia :). I think it might be of general interest and I will answer it myself ...)
Dynamic arrays are datastructures that allow addition of elements at the end in amortized constant time O(1) (by doubling the size of the allocated memory each time it needs to grow, see Amortized time of dynamic array for a short analysis).
However, insertion of a single element in the middle of the array takes linear time O(n), since in the worst case (i.e. insertion at first position) all other elements needs to be shifted by one.
If I want to insert k elements at a specific index in the array, the naive approach of performit the insert operation k times would thus lead to a complexity of O(n*k) and, if k=O(n), to a quadratic complexity of O(n²).
If I know k in advance, the solution is quite easy: Expand the array if neccessary (possibly reallocating space), shift the elements starting at the insertion point by k and simply copy the new elements.
But there might be situations, where I do not know the number of elements I want to insert in advance: For example I might get the elements from a stream-like interface, so I only get a flag when the last element is read.
Is there a way to insert multiple (k) elements, where k is not known in advance, into a dynamic array at consecutive positions in linear time?
In fact there is a way and it is quite simple:
First append all k elements at the end of the array. Since appending one element takes O(1) time, this will be done in O(k) time.
Second rotate the elements into place. If you want to insert the elements at position index. For this you need to rotate the subarray A[pos..n-1] by k positions to the right (or n-pos-k positions to the left, which is equivalent). Rotation can be done in linear time by use of a reverse operation as explained in Algorithm to rotate an array in linear time. Thus the time needed for rotation is O(n).
Therefore the total time for the algorithm is O(k)+O(n)=O(n+k). If the number of elements to be inserted is in the order of n (k=O(n)), you'll get O(n+n)=O(2n)=O(n) and thus linear time.
You could simply allocate a new array of length k+n and insert the desired elements linearly.
newArr = new T[k + n];
for (int i = 0; i < k + n; i++)
newArr[i] = i <= insertionIndex ? oldArr[i]
: i <= insertionIndex + k ? toInsert[i - insertionIndex - 1]
: oldArr[i - k];
return newArr;
Each iteration takes constant time, and it runs k+n times, thus O(k+n) (or, O(n) if you so like).

Find the minimum number of elements required so that their sum equals or exceeds S

I know this can be done by sorting the array and taking the larger numbers until the required condition is met. That would take at least nlog(n) sorting time.
Is there any improvement over nlog(n).
We can assume all numbers are positive.
Here is an algorithm that is O(n + size(smallest subset) * log(n)). If the smallest subset is much smaller than the array, this will be O(n).
Read http://en.wikipedia.org/wiki/Heap_%28data_structure%29 if my description of the algorithm is unclear (it is light on details, but the details are all there).
Turn the array into a heap arranged such that the biggest element is available in time O(n).
Repeatedly extract the biggest element from the heap until their sum is large enough. This takes O(size(smallest subset) * log(n)).
This is almost certainly the answer they were hoping for, though not getting it shouldn't be a deal breaker.
Edit: Here is another variant that is often faster, but can be slower.
Walk through elements, until the sum of the first few exceeds S. Store current_sum.
Copy those elements into an array.
Heapify that array such that the minimum is easy to find, remember the minimum.
For each remaining element in the main array:
if min(in our heap) < element:
insert element into heap
increase current_sum by element
while S + min(in our heap) < current_sum:
current_sum -= min(in our heap)
remove min from heap
If we get to reject most of the array without manipulating our heap, this can be up to twice as fast as the previous solution. But it is also possible to be slower, such as when the last element in the array happens to be bigger than S.
Assuming the numbers are integers, you can improve upon the usual n lg(n) complexity of sorting because in this case we have the extra information that the values are between 0 and S (for our purposes, integers larger than S are the same as S).
Because the range of values is finite, you can use a non-comparative sorting algorithm such as Pigeonhole Sort or Radix Sort to go below n lg(n).
Note that these methods are dependent on some function of S, so if S gets large enough (and n stays small enough) you may be better off reverting to a comparative sort.
Here is an O(n) expected time solution to the problem. It's somewhat like Moron's idea but we don't throw out the work that our selection algorithm did in each step, and we start trying from an item potentially in the middle rather than using the repeated doubling approach.
Alternatively, It's really just quickselect with a little additional book keeping for the remaining sum.
First, it's clear that if you had the elements in sorted order, you could just pick the largest items first until you exceed the desired sum. Our solution is going to be like that, except we'll try as hard as we can to not to discover ordering information, because sorting is slow.
You want to be able to determine if a given value is the cut off. If we include that value and everything greater than it, we meet or exceed S, but when we remove it, then we are below S, then we are golden.
Here is the psuedo code, I didn't test it for edge cases, but this gets the idea across.
def Solve(arr, s):
# We could get rid of worse case O(n^2) behavior that basically never happens
# by selecting the median here deterministically, but in practice, the constant
# factor on the algorithm will be much worse.
p = random_element(arr)
left_arr, right_arr = partition(arr, p)
# assume p is in neither left_arr nor right_arr
right_sum = sum(right_arr)
if right_sum + p >= s:
if right_sum < s:
# solved it, p forms the cut off
return len(right_arr) + 1
# took too much, at least we eliminated left_arr and p
return Solve(right_arr, s)
else:
# didn't take enough yet, include all elements from and eliminate right_arr and p
return len(right_arr) + 1 + Solve(left_arr, s - right_sum - p)
One improvement (asymptotically) over Theta(nlogn) you can do is to get an O(n log K) time algorithm, where K is the required minimum number of elements.
Thus if K is constant, or say log n, this is better (asymptotically) than sorting. Of course if K is n^epsilon, then this is not better than Theta(n logn).
The way to do this is to use selection algorithms, which can tell you the ith largest element in O(n) time.
Now do a binary search for K, starting with i=1 (the largest) and doubling i etc at each turn.
You find the ith largest, and find the sum of the i largest elements and check if it is greater than S or not.
This way, you would run O(log K) runs of the selection algorithm (which is O(n)) for a total running time of O(n log K).
eliminate numbers < S, if you find some number ==S, then solved
pigeon-hole sort the numbers < S
Sum elements highest to lowest in the sorted order till you exceed S.

Compare two integer arrays with same length

[Description] Given two integer arrays with the same length. Design an algorithm which can judge whether they're the same. The definition of "same" is that, if these two arrays were in sorted order, the elements in corresponding position should be the same.
[Example]
<1 2 3 4> = <3 1 2 4>
<1 2 3 4> != <3 4 1 1>
[Limitation] The algorithm should require constant extra space, and O(n) running time.
(Probably too complex for an interview question.)
(You can use O(N) time to check the min, max, sum, sumsq, etc. are equal first.)
Use no-extra-space radix sort to sort the two arrays in-place. O(N) time complexity, O(1) space.
Then compare them using the usual algorithm. O(N) time complexity, O(1) space.
(Provided (max − min) of the arrays is of O(Nk) with a finite k.)
You can try a probabilistic approach - convert the arrays into a number in some huge base B and mod by some prime P, for example sum B^a_i for all i mod some big-ish P. If they both come out to the same number, try again for as many primes as you want. If it's false at any attempts, then they are not correct. If they pass enough challenges, then they are equal, with high probability.
There's a trivial proof for B > N, P > biggest number. So there must be a challenge that cannot be met. This is actually the deterministic approach, though the complexity analysis might be more difficult, depending on how people view the complexity in terms of the size of the input (as opposed to just the number of elements).
I claim that: Unless the range of input is specified, then it is IMPOSSIBLE to solve in onstant extra space, and O(n) running time.
I will be happy to be proven wrong, so that I can learn something new.
Insert all elements from the first array into a hashtable
Try to insert all elements from the second array into the same hashtable - for each insert to element should already be there
Ok, this is not with constant extra space, but the best I could come up at the moment:-). Are there any other constraints imposed on the question, like for example to biggest integer that may be included in the array?
A few answers are basically correct, even though they don't look like it. The hash table approach (for one example) has an upper limit based on the range of the type involved rather than the number of elements in the arrays. At least by by most definitions, that makes the (upper limit on) the space a constant, although the constant may be quite large.
In theory, you could change that from an upper limit to a true constant amount of space. Just for example, if you were working in C or C++, and it was an array of char, you could use something like:
size_t counts[UCHAR_MAX];
Since UCHAR_MAX is a constant, the amount of space used by the array is also a constant.
Edit: I'd note for the record that a bound on the ranges/sizes of items involved is implicit in nearly all descriptions of algorithmic complexity. Just for example, we all "know" that Quicksort is an O(N log N) algorithm. That's only true, however, if we assume that comparing and swapping the items being sorted takes constant time, which can only be true if we bound the range. If the range of items involved is large enough that we can no longer treat a comparison or a swap as taking constant time, then its complexity would become something like O(N log N log R), were R is the range, so log R approximates the number of bits necessary to represent an item.
Is this a trick question? If the authors assumed integers to be within a given range (2^32 etc.) then "extra constant space" might simply be an array of size 2^32 in which you count the occurrences in both lists.
If the integers are unranged, it cannot be done.
You could add each element into a hashmap<Integer, Integer>, with the following rules: Array A is the adder, array B is the remover. When inserting from Array A, if the key does not exist, insert it with a value of 1. If the key exists, increment the value (keep a count). When removing, if the key exists and is greater than 1, reduce it by 1. If the key exists and is 1, remove the element.
Run through array A followed by array B using the rules above. If at any time during the removal phase array B does not find an element, you can immediately return false. If after both the adder and remover are finished the hashmap is empty, the arrays are equivalent.
Edit: The size of the hashtable will be equal to the number of distinct values in the array does this fit the definition of constant space?
I imagine the solution will require some sort of transformation that is both associative and commutative and guarantees a unique result for a unique set of inputs. However I'm not sure if that even exists.
public static boolean match(int[] array1, int[] array2) {
int x, y = 0;
for(x = 0; x < array1.length; x++) {
y = x;
while(array1[x] != array2[y]) {
if (y + 1 == array1.length)
return false;
y++;
}
int swap = array2[x];
array2[x] = array2[y];
array2[y] = swap;
}
return true;
}
For each array, Use Counting sort technique to build the count of number of elements less than or equal to a particular element . Then compare the two built auxillary arrays at every index, if they r equal arrays r equal else they r not . COunting sort requires O(n) and array comparison at every index is again O(n) so totally its O(n) and the space required is equal to the size of two arrays . Here is a link to counting sort http://en.wikipedia.org/wiki/Counting_sort.
given int are in the range -n..+n a simple way to check for equity may be the following (pseudo code):
// a & b are the array
accumulator = 0
arraysize = size(a)
for(i=0 ; i < arraysize; ++i) {
accumulator = accumulator + a[i] - b[i]
if abs(accumulator) > ((arraysize - i) * n) { return FALSE }
}
return (accumulator == 0)
accumulator must be able to store integer with range = +- arraysize * n
How 'bout this - XOR all the numbers in both the arrays. If the result is 0, you got a match.

Find duplicate entry in array of integers

As a homework question, the following task had been given:
You are given an array with integers between 1 and 1,000,000. One
integer is in the array twice. How can you determine which one? Can
you think of a way to do it using little extra memory.
My solutions so far:
Solution 1
List item
Have a hash table
Iterate through array and store its elements in hash table
As soon as you find an element which is already in hash table, it is
the dup element
Pros
It runs in O(n) time and with only 1 pass
Cons
It uses O(n) extra memory
Solution 2
Sort the array using merge sort (O(nlogn) time)
Parse again and if you see a element twice you got the dup.
Pros
It doesn't use extra memory
Cons
Running time is greater than O(n)
Can you guys think of any better solution?
The question is a little ambiguous; when the request is "which one," does it mean return the value that is duplicated, or the position in the sequence of the duplicated one? If the former, any of the following three solutions will work; if it is the latter, the first is the only that will help.
Solution #1: assumes array is immutable
Build a bitmap; set the nth bit as you iterate through the array. If the bit is already set, you've found a duplicate. It runs on linear time, and will work for any size array.
The bitmap would be created with as many bits as there are possible values in the array. As you iterate through the array, you check the nth bit in the array. If it is set, you've found your duplicate. If it isn't, then set it. (Logic for doing this can be seen in the pseudo-code in this Wikipedia entry on Bit arrays or use the System.Collections.BitArray class.)
Solution #2: assumes array is mutable
Sort the array, and then do a linear search until the current value equals the previous value. Uses the least memory of all. Bonus points for altering the sort algorithm to detect the duplicate during a comparison operation and terminating early.
Solution #3: (assumes array length = 1,000,001)
Sum all of the integers in the array.
From that, subtract the sum of the integers 1 through 1,000,000 inclusive.
What's left will be your duplicated value.
This take almost no extra memory, can be done in one pass if you calculate the sums at the same time.
The disadvantage is that you need to do the entire loop to find the answer.
The advantages are simplicity, and a high probability it will in fact run faster than the other solutions.
Assuming all the numbers from 1 to 1,000,000 are in the array, the sum of all numbers from 1 to 1,000,000 is (1,000,000)*(1,000,000 + 1)/2 = 500,000 * 1,000,001 = 500,000,500,000.
So just add up all the numbers in the array, subtract 500,000,500,000, and you'll be left with the number that occured twice.
O(n) time, and O(1) memory.
If the assumption isn't true, you could try using a Bloom Filter - they can be stored much more compactly than a hash table (since they only store fact of presence), but they do run the risk of false positives. This risk can be bounded though, by our choice of how much memory to spend on the bloom filter.
We can then use the bloom filter to detect potential duplicates in O(n) time and check each candidate in O(n) time.
This python code is a modification of QuickSort:
def findDuplicate(arr):
orig_len = len(arr)
if orig_len <= 1:
return None
pivot = arr.pop(0)
greater = [i for i in arr if i > pivot]
lesser = [i for i in arr if i < pivot]
if len(greater) + len(lesser) != orig_len - 1:
return pivot
else:
return findDuplicate(lesser) or findDuplicate(greater)
It finds a duplicate in O(n logn)), I think. It uses extra memory on the stack, but it can be rewritten to use only one copy of the original data, I believe:
def findDuplicate(arr):
orig_len = len(arr)
if orig_len <= 1:
return None
pivot = arr.pop(0)
greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
if len(arr):
return pivot
else:
return findDuplicate(lesser) or findDuplicate(greater)
The list comprehensions that produce greater and lesser destroy the original with calls to pop(). If arr is not empty after removing greater and lesser from it, then there must be a duplicate and it must be pivot.
The code suffers from the usual stack overflow problems on sorted data, so either a random pivot or an iterative solution which queues the data is necessary:
def findDuplicate(full):
import copy
q = [full]
while len(q):
arr = copy.copy(q.pop(0))
orig_len = len(arr)
if orig_len > 1:
pivot = arr.pop(0)
greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
if len(arr):
return pivot
else:
q.append(greater)
q.append(lesser)
return None
However, now the code needs to take a deep copy of the data at the top of the loop, changing the memory requirements.
So much for computer science. The naive algorithm clobbers my code in python, probably because of python's sorting algorithm:
def findDuplicate(arr):
arr = sorted(arr)
prev = arr.pop(0)
for element in arr:
if element == prev:
return prev
else:
prev = element
return None
Rather than sorting the array and then checking, I would suggest writing an implementation of a comparison sort function that exits as soon as the dup is found, leading to no extra memory requirement (depending on the algorithm you choose, obviously) and a worst case O(nlogn) time (again, depending on the algorithm), rather than a best (and average, depending...) case O(nlogn) time.
E.g. An implementation of in-place merge sort.
http://en.wikipedia.org/wiki/Merge_sort
Hint: Use the property that A XOR A == 0, and 0 XOR A == A.
As a variant of your solution (2), you can use radix sort. No extra memory, and will run in
linear time. You can argue that time is also affected by the size of numbers representation, but you have already given bounds for that: radix sort runs in time O(k n), where k is the number of digits you can sort ar each pass. That makes the whole algorithm O(7n)for sorting plus O(n) for checking the duplicated number -- which is O(8n)=O(n).
Pros:
No extra memory
O(n)
Cons:
Need eight O(n) passes.
And how about the problem of finding ALL duplicates? Can this be done in less than
O(n ln n) time? (Sort & scan) (If you want to restore the original array, carry along the original index and reorder after the end, which can be done in O(n) time)
def singleton(array):
return reduce(lambda x,y:x^y, array)
Sort integer by sorting them on place they should be. If you get "collision" than you found the correct number.
space complexity O(1) (just same space that can be overwriten)
time complexity less than O(n) becuse you will statisticaly found collison before getting on the end.

Resources