Is finding a pair of equal integers in an array O(n)? - arrays

Given an array of integers what is the worst case time complexity that would find pair of integers which are same ?
I think this can be done in O(n) by using counting sort or by using XOR .
Am i right ?
Question is not worried about space complexity and answer says O(nlgn).

Counting sort
If the input allows you to use counting sort, then all you have to do is sort the input array in O(n) time and then look for duplicates, also in O(n) time. This algorithm can be improved (although not in complexity), since you don't actually need to sort the array. You can create the same auxiliary array that counting sort uses, which is indexed by the input integers, and then add these integers one by one until the current one has already been inserted. At this point, the two equal integers have been found.
This solution provides worst-case, average and best-case linear time complexities (O(n)), but requires the input integers to be in a known and ideally small range.
Hashing
If you cannot use counting sort, then you could fall back on hashing and use the same solution as before (without sorting), with a hash table instead of the auxiliary array. The issue with hash tables is that the worst-case time complexity of their operations is linear, not constant. Indeed, due to collisions and rehashing, insertions are done in O(n) time in the worst case.
Since you need O(n) insertions, that makes the worst-case time complexity of this solution quadratic (O(n²)), even though its average and best-case time complexities are linear (O(n)).
Sorting
Another solution, in case counting sort is not applicable, is to use another sorting algorithm. The worst-case time complexity for comparison-based sorting algorithms is, at best, O(n log n). The solution would be to sort the input array and look for duplicates in O(n) time.
This solution has worst-case and average time complexities of O(n log n), and depending on the sorting algorithm, a best-case linear time complexity (O(n)).

Following is the pseudo code for Counting Sort:
# input -- the array of items to be sorted; key(x) returns the key for item x
# n -- the length of the input
# k -- a number such that all keys are in the range 0..k-1
# count -- an array of numbers, with indexes 0..k-1, initially all zero
# output -- an array of items, with indexes 0..n-1
# x -- an individual input item, used within the algorithm
# total, oldCount, i -- numbers used within the algorithm
# calculate the histogram of key frequencies:
for x in input:
count[key(x)] += 1
# calculate the starting index for each key:
total = 0
for i in range(k): # i = 0, 1, ... k-1
oldCount = count[i]
count[i] = total
total += oldCount
# copy to output array, preserving order of inputs with equal keys:
for x in input:
output[count[key(x)]] = x
count[key(x)] += 1
return output
As you can observe, all the keys are in the range of 0 ... k-1. In your case number itself is the key, and it has to be in certain range for counting sort to be applicable. Only then it can be done in O(n) with O(k) space.
Otherwise, solution is O(nlogn) using any comparison based sorting.

If you subscribe to integer sorts being O(n), then by all means this is O(n) by sorting + iterating until two adjacent elements compare equal.
Hashing is actually O(n2) in the worst case (you have the world's worst hashing algorithm that hashes everything to the same index). Although in practice using a hash table to get counts will give you linear time performance (average case).
In reality, linear time integer sorts "cheat" by fixing the number of bits used to represent an integer as some constant k that can then be ignored later. (In practice, though, these are great assumptions and integer sorts can be really fast!)
Comparison-based sorts like merge sort will give you O(n log n) complexity in the worst case.
The XOR solution you speak of is for finding a single unique "extra" item between two otherwise identical lists of integers.

Related

Understanding the Big O for squaring elements in an array

I was working on a problem where you have to square the numbers in a sorted array on leetcode. Here is the original problem
Given an array of integers A sorted in non-decreasing order, return an array of the squares of each number, also in sorted non-decreasing order.
I am trying to understand the big O for my code and for the code that was given in the solution.
This is my code
def sortedSquare(A):
new_A = []
for num in A:
num = num*num
new_A.append(num)
return sorted(new_A)
print(sortedSquare([-4, -1, 0, 3, 10]))
Here is the code from the solution:
def sortedSquares(self, A):
return sorted(x*x for x in A)
For the solution, the Big O is
NlogN
Where N is the length of the array. I don't understand why it would be logN and not just N for the Big O?
For my solution, I am seeing it as Big O of N because I am just iterating through the entire array.
Also, is my solution a good solution compared to the solution that was given?
Your solution does the exact same thing as the given solution. Both solutions square all the elements and then sort the resultant array, with the leetcode solution being a bit more concise.
The reason why both these solutions are O(NlogN) is because of the use of sorted(). Python's builtin sort is timsort which sorts the array in O(NlogN) time. The use of sorted(), not squaring, is what provides the dominant factor in your time complexity (O(NlogN) + O(N) = O(NlogN)).
Note though that this problem can be solved in O(N) using two pointers or by using the merge step in mergesort.
Edit:
David Eisenstat brought up a very good point on timsort. Timsort aggregates strictly increasing and strictly decreasing runs and merges them. Since the resultant squared array will be first strictly decreasing and then strictly increasing, timsort will actually reverse the strictly decreasing run and then merge them in O(N).
The way complexity works is that the overall complexity for the whole program is the worst complexity for any one part. So, in your case, you have the part that squares the numbers and you have the part that sorts the numbers. So which part is the one that determines the overall complexity?
The squaring part is o(n) because you only touch the elements once in order to square them.
What about the sorting part? Generally it depends on what sorting function you use:
Most sort routines have O(n*log(n)) because they use a divide and conquer algorithm.
Some (like bubble sort) have O(n^2)
Some (like the counting sort) have O(n)
In your case, they say that the given solution is O(n*log(n)) and since the squaring part is O(n) then the sorting part must be O(n*log(n)). And since your code uses the same sorting function as the given solution your sort must also be O(n*log(n))
So your squaring part is O(n) and your sorting part is O(n*log(n)) and the overall complexity is the worst of those: O(n*log(n))
If extra storage space is allowed (like in your solution), the whole process can be performed in time O(N). The initial array is already sorted. You can split it in two subsequences with the negative and positive values.
Square all elements (O(N)) and reverse the negative subsequence (O(N) at worse), so that both sequences are sorted. If one of the subsequences is empty, you are done.
Otherwise, merge the two sequences, in time O(N) (this is the step that uses extra O(N) space).

Sorting a partially sorted array in O(n)

Hey so I'm just really stuck on this question.
I need to devise an algorithm (no need for code) that sorts a certain partially sorted array into a fully sorted array. The array has N real numbers and the first N-[N\sqrt(N)] (the [] denotes the floor of this number) elements are sorted, while are the rest are not. There are no special properties to the unsorted numbers at the end, in fact I'm told nothing about them other than they're obviously real numbers like the rest.
The kicker is time complexity for the algorithm needs to be O(n).
My first thought was to try and sort only the unsorted numbers and then use a merge algorithm, but I can't figure out any sorting algorithm that would work here in O(n). So I'm thinking about this all wrong, any ideas?
This is not possible in the general case using a comparison-based sorting algorithm. You are most likely missing something from the question.
Imagine the partially sorted array [1, 2, 3, 4564, 8481, 448788, 145, 86411, 23477]. It contains 9 elements, the first 3 of which are sorted (note that floor(N/sqrt(N)) = floor(sqrt(N)) assuming you meant N/sqrt(N), and floor(sqrt(9)) = 3). The problem is that the unsorted elements are all in a range that does not contain the sorted elements. It makes the sorted part of the array useless to any sorting algorithm, since they will stay there anyway (or be moved to the very end in the case where they are greater than the unsorted elements).
With this kind of input, you still need to sort, independently, N - floor(sqrt(N)) elements. And as far as I know, N - floor(sqrt(N)) ~ N (the ~ basically means "is the same complexity as"). So you are left with an array of approximately N elements to sort, which takes O(N log N) time in the general case.
Now, I specified "using a comparison-based sorting algorithm", because sorting real numbers (in some range, like the usual floating-point numbers stored in computers) can be done in amortized O(N) time using a hash sort (similar to a counting sort), or maybe even a modified radix sort if done properly. But the fact that a part of the array is already sorted doesn't help.
In other words, this means there are sqrt(N) unsorted elements at the end of the array. You can sort them with an O(n^2) algorithm which will give a time of O(sqrt(N)^2) = O(N); then do the merge you mentioned which will also run in O(N). Both steps together will therefore take just O(N).

Array with specific values

Given an array the size of n where:
1/2 of the array is with a single (unknown) value.
1/4 of the array is with a single (unknown) different value.
And so on for 1/8, 1/16, 1/32
Give an algorithm to sort the array.
You cannot use the find median algorithm
So what I figured is:
There are only logn different values
There is a simple solution using a binary heap on O ( n*loglogn)
It looks like a question that needed to be solved in O (n)
Here is one possible approach:
scan the array and store element frequencies (there are log n distinct elements) in a hash table in amortized O(n) time; this is doable because we can do insertions in amortized O(1) time;
now run a classic sorting algorithm on these log n elements: this is doable in deterministic O(log n log log n) time using, say, heap sort or merge sort;
now expand the sorted array---or create a new one and fill it using the sorted array and the hash table---using frequencies from the hash table; this is doable in O(n) amortized time.
The whole algorithm thus runs in amortized O(n) time, i.e., it is dominated by eliminating duplicates and expanding the sorted array. The space complexity is O(n).
This is essentially optimal because you need to "touch" all the elements to print the sorted array, which means we have a matching lower bound of Omega(n) on the running time.
The idea is to use the Majority algorithm that takes O(n) then discovering what's the "half" value deleting it from the array and then doing it again on the new array
n+n/2+n/4+n/8+..... < 2n => O (n)
Going over the array once, keep hash map for seen values.
Like you said there are only log(n) different values.
Now you have list of all the different values - sorting them will take lon(n)*log(log(n))
Once you have the sorted uniq like it's easy to constract the original array : The max value will take n/2 cells , the 2nd take n/4 and so on.
The Total run time is O(n + lon(n)*log(log(n)) + n) which is O(n)

Efficient way to compute sum of k largest numbers in a list?

I was reading some practice interview questions and I have a question about this one. Assume a list of random integers each between 1 & 100, compute the sum of k largest integers? Discuss space and time complexity and whether the approach changes if each integer is between 1 & m where m varies?
My first thought is to sort the array and compute the sum of largest k numbers. Then, I thought if I use a binary tree structure where I can look starting from bottom right tree. I am not sure if my approach would change whether numbers are 1 to 100 or 1 to m? Any thoughts of most efficient approach?
The most efficient way might be to use something like randomized quickselect. It doesn't do the sorting step to completion and instead does just the partition step from quicksort. If you don't want the k largest integers in some particular order, this would be the way I'd go with. It takes linear time but the analysis is not very straightforward. m would have little impact on this. Also, you can write code in such a way that the sum is computed as you partition the array.
Time: O(n)
Space: O(1)
The alternative is sorting using something like counting sort which has a linear time guarantee. As you say the values are integers in a fixed range, it would work quite well. As m increases the space requirement goes up, but computing the sum is quite efficient within the buckets.
Time: O(m) in the worst case (see comments for the argument)
Space: O(m)
I'd say sorting is probably uneccessary. If k is small, then all you need to do is maintain a sorted list that truncates elements beyond the kth largest element.
Each step in this should be O(k) in the worst possible case where the element added is maximized. However, the average case scenario is much better, after a certain number of elements, most should just be smaller than the last element in the list and the operation will be O(log(k)).
One way is to use a min-heap (implemented as a binary tree) of maximum size k. To see if a new element belongs in the heap or not is only O(1) since it's a min-heap and retrieval of minimum element is a constant time operation. Each insertion step (or non-insertion...in the case of an element that is too small to be inserted) along the O(n) list is O(log k). The final tree traversal and summation step is O(k).
Total complexity:
O (n log k + k) = O(n log k))
Unless you have multiple cores running on your computer, in which case, parallel computing is an option, summation should only be done at the end. On-the-fly-computing adds additional computation steps without actually reducing your time complexity at all (you will actually have more computations to do) . You will always have to sum k elements anyways, so why not avoid the additional addition and subtraction steps?

fastest way to find if all the elements of an array are distinct?

I am looking for a faster way to find whether an array of elements contains distinct elements only. The worst thing to do is to take each element and compare it to every other element in the array. Next best would be to sort the list and then compare, which still does not improves this much. Is there any other way to do this?
Brute-force:
Brute-force (checking every element with every other element) takes O(n2).
Sorting:
Sorting takes O(n log n), which is generally considered to be a fairly decent running time.
Sorting has the advantage above the below (hash table) approach in that it can be done in-place (O(1) extra space), where-as the below takes O(n) extra space.
Hash table:
An alternative is to use a hash table.
For each item:
Check if that item exists in the hash table (if it does, all items are not distinct) and
Insert that item into the hash table
Since insert and contains queries run in expected O(1) on a hash table, the overall running time would be expected O(n), and, as mentioned above, O(n) extra space.
Bit array:
Another alternative, if the elements are all integers in some given range, is to have a bit array with size equal to the range of integers.
Similarly to what was done for the hash table approach, for each element, you'd check whether the applicable bit is set, and then set it.
This takes O(m + n) time and O(m) extra space where m is the range of integers and n is the size of the array (unless you consider allocating the array to be free, in which case it just takes O(n) time).
Create a Red and Black tree where elements as keys and number of occurrences are value. You can then navigate the tree. Time and space complexity is O(n) where n is the number of elements. Key benefits using a red and black tree include consistent performance and simple memory management - excellent choice for a distributed environement. Perspectives welcome.
Alternative solution (interesting only from theoretic point of view):
I think you can adapt the Quickselect algorithm. In short, this algorithm runs in the same way as Quick sort, but it only splits the array in two groups, according to some chosen pivot (less and more than the pivot, respectively), thus the sorting is omitted. It's average case performance is O(n).
My idea is to look for elements equal to the chosen pivot on each step. In this way, whenever there are more than two elements, we will compare the pivot to each element. If we have found a duplicate, we have the answer. Otherwise we split the problem in two similar ones, but with smaller size and run the algorithm on them.
Disclaimer: The worst case performance of Quickselect is O(n^2). Therefore, using a hash table is way more time efficient.
However, as Quickselect is an in-place algorithm, it requires only constant memory overhead as opposed to linear additional memory for a hash table (not that it matters nowadays).
Here is O(1) space complexity approach. The idea is that we will just keep the array with unique elements in the beginning of itself.
The time complexity is O(n*log(n)) since we want to avoid space usage and so we can use python's in-place sort method for list.
It may feel like C, but worked for me
a.sort()
i = 0
k = 0
while i < len(a) - 1:
if a[i] == a[i+1]:
j = i
while j < len(a) - 1 and a[j] == a[j+1]:
j += 1
if j < len(a) - 1:
a[k+1] = a[j+1]
i = j + 1
k += 1
else:
pass
else:
i += 1
k += 1
a = a[:k+1]

Resources