Find the one non-repeating element in array? - arrays

I have an array of n elements in which only one element is not repeated, else all the other numbers are repeated >1 times. And there is no limit on the range of the numbers in the array.
Some solutions are:
Making use of hash, but that would result in linear time complexity but very poor space complexity
Sorting the list using MergeSort O(nlogn) and then finding the element which doesn't repeat
Is there a better solution?

One general approach is to implement a bucketing technique (of which hashing is such a technique) to distribute the elements into different "buckets" using their identity (say index) and then find the bucket with the smallest size (1 in your case). This problem, I believe, is also known as the minority element problem. There will be as many buckets as there are unique elements in your set.
Doing this by hashing is problematic because of collisions and how your algorithm might handle that. Certain associative array approaches such as tries and extendable hashing don't seem to apply as they are better suited to strings.
One application of the above is to the Union-Find data structure. Your sets will be the buckets and you'll need to call MakeSet() and Find() for each element in your array for a cost of $O(\alpha(n))$ per call, where $\alpha(n)$ is the extremely slow-growing inverse Ackermann function. You can think of it as being effectively a constant.
You'll have to do Union when an element already exist. With some changes to keep track of the set with minimum cardinality, this solution should work. The time complexity of this solution is $O(n\alpha(n))$.
Your problem also appears to be loosely related to the Element Uniqueness problem.

Try a multi-pass scanning if you have strict space limitation.
Say the input has n elements and you can only hold m elements in your memory. If you use a hash-table approach, in the worst case you need to handle n/2 unique numbers so you want m>n/2. In case you don't have that big m, you can partition n elements to k=(max(input)-min(input))/(2m) groups, and go ahead scan the n input elements k times (in the worst case):
1st run: you only hash-get/put/mark/whatever elements with value < min(input)+m*2; because in the range (min(input), min(input)+m*2) there are at most m unique elements and you can handle that. If you are lucky you already find the unique one, otherwise continue.
2nd run: only operate on elements with value in range (min(input)+m*2, min(input)+m*4), and
so on, so forth
In this way, you compromise the time complexity to a O(kn), but you get a space complexity bound of O(m)

Two ideas come to my mind:
A smoothsort may be a better alternative than the cited mergesort for your needs given it's O(1) in memory usage, O(nlogn) in the worst case as the merge sort but O(n) in the best case;
Based on the (reverse) idea of the splay tree, you could make a type of tree that would
push the leafs toward the bottom once they are used (instead of upward as in the splay tree). This would still give you a O(nlogn) implantation of the sort, but the advantage would be the O(1) step of finding the unique element, it would be the root. The sorting algorithm is the sum of O(nlogn) + O(n) and this algorithm would be O(nlogn) + O(1)
Otherwise, as you stated, using a hash based solution (like hash-implemented set) would result in a O(n) algorithm (O(n) to insert and add a counting reference to it and O(n) to traverse your set to find the unique element) but you seemed to dislike the memory usage, though I don't know why. Memory is cheap, these days...

Related

What's the most efficient way to construct a new, sorted, array?

Background
Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one? Here's an example that will clear things up:
Example
I'm generating N random numbers and want to insert them into a new array as I generate them, and I want the final array to be sorted.
Possible Solutions
Insertion Sort
My gut told me that putting each element in the correct place as it's generated would be fastest. This is accomplished by doing a binary search to find the correct point in the array to insert the new element. However, this is an insertion sort, which is known to be less efficient on large lists than other sorting algorithms.
Quicksort
Quicksort is generally thought of as the most efficient 'general' sorting algorithm, where nothing is known about the inputs to the array, and it's more efficient than insertion sort on large lists. Would it, therefore, be best to simply put the random numbers in the array in an unsorted order, and then sort them at the end with quicksort?
Other Solutions
Is there another algorithm I haven't thought of?
Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one? 
It boils down to the same problem for random data, due to efficiency considerations.
Given random data, it's actually more efficient to first generate the random values into an array (unsorted) - O(n) time complexity - and then sort it with your favorite O(n log n) sort algorithm, making the entire operation O(2n log n) time complexity, and, depending on sort algorithm used, between O(1) and O(n) space complexity.
There is no way to beat that approach by "keeping an array sorted as it's constructed" for random data, because any approach will require exactly O(n) generations/insertions of the values, and at least O(n log n) comparisons/swaps/shifts - no matter which method, from the numerous mentioned in comments on the original question, is used. Note, as per a very useful comment on my original answer, the binary insertion sort variant suggested in the original question will likely degrade to O(n^2) time complexity, making it an inferior solution to just generating an array of values first and then sorting it.
Using a balanced tree just matches the time complexity of generating an array and then sorting it - but loses in space complexity, as trees have some overhead, compared to an array, to keep track of child nodes, etc. Also of note, trees are heap-allocated, and require a pointer dereference operation for accessing any child node - so even though the Big-O time complexity is equivalent to first generating an array of data and then sorting it, the real performance of the tree solution will be worse, as there's no data locality, and there's extra cost of pointer dereference. An additional consideration on balanced trees is that insertion cost into something like an AVL is quite high - that is, the n in AVL's O(n log n) insertion is not the same cost as n in an in-place sort of an array, due to necessary rotations of tree nodes to achieve balance. Just because Big-O is the same doesn't mean performance is the same. Even if you have an absolute need to be able to grab the data in a sorted order at some point during construction of the array, it might still be cheaper to just sort an array as you need it, unless you need it sorted at each insertion!
Note, this answer pertains to random data - it is possible, and even likely, to come up with a more efficient approach for "keeping an array sorted as it's constructed" if both the size and characteristics of the data are known, and follow some mathematical pattern, other than randomness; however, such approach would necessarily be overfit for the specific data set it relates to, rather than a general solution.
I recommend the Heapsort or Mergesort.
Heapsort is a comparison-based algorithm that uses a binary heap data structure to sort elements. It divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element and moving that to the sorted region.
Mergesort is a comparison-based algorithm that focuses on how to merge together two pre-sorted arrays such that the resulting array is also sorted.
If you want a true O(nlogn) and sorted "as it is constructed", I would recommend using a proper (tree) based data structure instead of array. You can use data structures like self balanced binary tree, AVL trees.

Check if there's a duplicate in an array using an hash-table

Goal: Check if there's a duplicate number in an array with the size of n.
Basically if we may use an hash-table (open-hash, with linked list), then we could iterate the array and insert the numbers to the table with some value (could be 1, doesn't really matter).
While iterating, if the cell isn't empty then we have a duplicate number.
Since we know that the expected time for read/write is O(1) then the expected time for the algorithm is O(n).
Question #1: Why is the worst-case equal O(nlogn)?
Question #2: Would you do it differently then the solution suggested?
In here, I assume the author referred to a variant of hash table, where in each "bin" there is a BST (or some other deterministic DS), and thus in the worst case, all elements are inserted to the same bin repeatidly - and that requires O(nlogn) overall.
However, hash tables are seldom implemented this way, because this worst case is very unlikely, and a regular linked list is implemented in this implementation, for this case - the worst case will be O(n^2) for this solution.
The other alternative to approach this problem is sort, and iterate to find duplicates (easy in sorted arrays), this is O(nlogn) with significantly less memory usage.
This problem is known as the element distinctness problem, and these two options (with some variants maybe) are the ways to solve it.
It is known to be Omega(nlogn) without using extra memory and hashing.

Comparing two String array in most efficient way

This problem is about searching a string in a master array (contains the list of all UIDs). The second array contains all the strings to be searched.
For example:
First array(Master List) contains: UID1 UID2 UID3... UID99
Second array contains: UID3 UID144 UID50
If a match is found in first array then 1 is returned otherwise 0 is return. So the output for the above example should be 101.
What could be the most efficient approach (targeting C) to solve the above keeping in mind that the traditional way dealing with this would be n^2!!!
sort the master string array and do binary search.
Efficient in terms of what?
I would go with #Trying's suggestion as a good compromise between decent running speed, low memory usage, and very (very!) low complexity of implementation.
Just use qsort() to sort the first master array in place, then use bsearch() to search it.
Assuming n elements in the master array and m in the second array, this should give O(m*log n) time complexity which seems decent.
Another option is to build a hash for the strings in the Master list, it's a single O(M) (assuming the lengths are O(1)), then assuming the hash is distributed evenly, searching a single element should take on average O(M/S), with S being the size the hash (the even distribution means that on average this is the amount of elements mapping into the same hash entry). You can further control the size to fine tune the trade off between space and efficiency
There are mainly two good approaches for this problem:
Use a binary search: a binary search requires the UIDs in the first array to be sorted and allows you to find a solution in O(log n) where n is the number of elements in the master array. The total complexity would be O(m log n) with m the number of elements to be searched.
Use a hashmap: You can store the elements of the master array in a hashmap (O(n)) and then check whether your elements of the second array are in the hashmap (O(m)). The total complexity would be O(n+m).
While the complexity of the second approach looks better, you must keep in mind that if your hash is bad, it could be O(m*n) in the worst case (but you would be very very unlikely). Also you would use more memory and the operations are also slower. In your case, I would use the first approach.

Sort an array which is partially sorted

I am trying to sort an array which has properties like
it increases upto some extent then it starts decreasing, then increases and then decreases and so on. Is there any algorithm which can sort this in less then nlog(n) complexity by making use of it being partially ordered?
array example = 14,19,34,56,36,22,20,7,45,56,50,32,31,45......... upto n
Thanks in advance
Any sequence of numbers will go up and down and up and down again etc unless they are already fully sorted (May start with a down, of course). You could run through the sequence noting the points where it changes direction, then then merge-sort the sequences (reverse reading the backward sequences)
In general the complexity is N log N because we don't know how sorted it is at this point. If it is moderately well sorted, i.e. there are fewer changes of direction, it will take fewer comparisons.
You could find the change / partition points, and perform a merge sort between pairs of partitions. This would take advantage of the existing ordering, as normally the merge sort starts with pairs of elements.
Edit Just trying to figure out the complexity here. Merge sort is n log(n), where the log(n) relates to the number of times you have to re-partition. First every pair of elements, then every pair of pairs, etc... until you reach the size of the array. In this case you have n elements with p partitions, where p < n, so I'm guessing the complexity is p log(p), but am open to correction. e.g. merge each pair of paritions, and repeat based on half the number of partitions after the merge.
See Topological sorting
If you know for a fact that the data are "almost sorted" and the set size is reasonably small (say an array that can be indexed by a 16-bit integer), then Shell is probably your best bet. Yes, it has a basic time complexity of O(n^2) (which can be reduced by the sequence used for gap sizing to a current best-worst-case of O(n*log^2(n))), but the performance improves with the sortedness of the input set to a best-case of O(n) on an already-sorted set. Using Sedgewick's sequence for gap size will give the best performance on those occasions when the input is not as sorted as you expected it to be.
Strand Sort might be close to what you're looking for. O(n sqrt(n)) in the average case, O(n) best case (list already sorted), O(n^2) worst case (list sorted in reverse order).
Share and enjoy.

How can we find the i'th greatest element of the array?

Algorithm for Finding nth smallest/largest element in an array using data strucuture self balancing binary search tree..
Read the post: Find kth smallest element in a binary search tree in Optimum way. But the correct answer is not clear, as i am not able to figure out the correct answer, for an example that i took...... Please a bit more explanation required.......
C.A.R. Hoare's select algorithm is designed for precisely this purpose. It executes in [expected] linear time, with logarithmic extra storage.
Edit: the obvious alternative of sorting, then picking the right element has O(N log N) complexity instead of O(N). Storing the i largest elements in sorted order requires O(i) auxiliary storage, and roughly O(N * i log i) complexity. This can be a win if i is known a priori to be quite small (e.g. 1 or 2). For more general use, select is usually better.
Edit2: offhand, I don't have a good reference for it, but described the idea in a previous answer.
First sort the array descending, then take the ith element.
Create a sorted data structure to hold i elements and set the initial count to 0.
Process each element in the source array, adding it to that new structure until the new structure is full.
Then process the rest of the source array. For each one that is larger than the smallest in the sorted data structure, remove the smallest from that structure and put the new one in.
Once you've processed all elements in the source array, your structure will hold the i greatest elements. Just grab the last of these and you have your i'th greatest element.
Voila!
Alternatively, sort it then just grab the i'th element directly.
That's a fitting task for the heaps which feature very low insert and low delete_min costs. E.g. pairing heaps. It would have the worst case O(n*log(n)) performance. But since non-trivial to implement, better check first suggested elsewhere selection algorithms.
There are many strategies available for your task (if you don't focus on the self-balancing tree to begin with).
It's usually a tradeoff speed / memory. Most algorithms require either to modify the array in place or O(N) additional storage.
The solution with self-balancing tree is in the latter category, but it's not the right choice here. The issue is that building the tree itself takes O(N*log N), which will dominate the later search term and give a final complexity of O(N*log N). Therefore you're not better than simply sorting the array and use a complex datastructure...
In general, the issue largely depends on the magnitude of i related to N. If you think for a minute, for i == 1 it's trivial right ? It's called finding the maximum.
Well, the same strategy obviously work for i == 2 (carrying the 2 maximum elements around) in linear time. And it's also trivially symmetric: ie if you need to find the N-1 th element, then just carry around the 2 minimum elements.
However, it loses efficiency when i is about N/2 or N/4. Carrying the i maximum elements then mean sorting an array of size i... and thus we fallback on the N*log N wall.
Jerry Coffin pointed out a simple solution, which works well for this case. Here is the reference on Wikipedia. The full article also describes the Median of Median method: it's more reliable, but involves more work and is thus generally slower.
Create an empty list L
For each element x in the original list,
add x in sorted position to L
if L has more than i elements,
pop the smallest one off L
if List2 has i elements,
return the i-th element,
else
return failure
This should take O(N (log (i))). If i is assumd to be a constant, then it is O(N).
Build a heap from the elements and call MIN i times.

Resources