I was given the following question in a technical interview:
How do i remove duplicates from an unsorted array?
One option I was thinking of:
Create a hash map with the frequency of each number in the array
Go through the array and do a O(1) lookup in the hash map. If the frequency > 0, remove the number from the array.
Is there a more efficient way?
Another option
Sort the array O(nlog n) using quick sort or merge sort
Then iterate through the array and remove duplicates
Why is option 1 better than option 2?
I cannot use any functions that already do the work like array_unique.
Instead of removing the object from the array if the hash map says there is a duplicate, why don't you build a new array for each item in the hash map, and only add it to the array if there isn't a duplicate? The idea is to save the extra step of having 2 arrays with equal overhead at the start. PHP sucks at garbage collection so if you start with a massive array, even though you unset its value, it might still be hanging around in memory.
For the first option, time complexity is O(n); because creating as hash map O(n) and iterating through the array O(n), so in total O(n).
For the second option, time complexity is O(log(n)); because sort O(log(n)) and iterating O(n), so in total O(log(n)).
Clearly first option is better. Hope this helps:)
If you have no constraints on creating another data structure to track state but must mutate the array in-place and only remove duplicates without sorting, then a variant of your first option may be best.
I propose you make a hashmap as you iterate the array, use the array values as keys and any garbage (boolean set to TRUE perhaps) as the value. As you encounter each item in the array (which is O(n)), check the map. If it exists, delete the item from the array, if not add they key-value pair. No need to track count, you only need to track what has been encountered.
Many languages have a built-in set abstract data type which basically perform this operation on a construction or add all operation. If you can provide a separate data structure with duplicates removed, just create a new set with the array's items and let that data structure remove duplicates.
Related
Some elements with integer keys are in an array. I want the elements with equal keys to be in groups inside the array. This can be accomplished by sorting the elements, however, it does not matter to me whether the elements are sorted, only that they are in groups of equal keys. Is there a way to accomplish this that is faster than sorting?
A hash map should work well on average. Use a "count" for the value, which gets incremented each time you see the corresponding key in the array, and then use those counts to overwrite your array.
That said, calling "sort" is still pretty fast and easier to read. A good quicksort can actually avoid some work when duplicates exist, so you should really run some benchmarks to be sure that an uglier approach is fast enough to be worthwhile.
I am new to Swift Lang, have seen lots of tutorials, but it's not clear – my question is what's the main difference between the Array, Set and Dictionary collection type?
Here are the practical differences between the different types:
Arrays are effectively ordered lists and are used to store lists of information in cases where order is important.
For example, posts in a social network app being displayed in a tableView may be stored in an array.
Sets are different in the sense that order does not matter and these will be used in cases where order does not matter.
Sets are especially useful when you need to ensure that an item only appears once in the set.
Dictionaries are used to store key, value pairs and are used when you want to easily find a value using a key, just like in a dictionary.
For example, you could store a list of items and links to more information about these items in a dictionary.
Hope this helps :)
(For more information and to find Apple's own definitions, check out Apple's guides at https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/CollectionTypes.html)
Detailed documentation can be found here on Apple's guide. Below are some quick definations extracted from there:
Array
An array stores values of the same type in an ordered list. The same value can appear in an array multiple times at different positions.
Set
A set stores distinct values of the same type in a collection with no defined ordering. You can use a set instead of an array when the order of items is not important, or when you need to ensure that an item only appears once.
Dictionary
A dictionary stores associations between keys of the same type and values of the same type in a collection with no defined ordering. Each value is associated with a unique key, which acts as an identifier for that value within the dictionary. Unlike items in an array, items in a dictionary do not have a specified order. You use a dictionary when you need to look up values based on their identifier, in much the same way that a real-world dictionary is used to look up the definition for a particular word.
Old thread yet worth to talk about performance.
With given N element inside an array or a dictionary it worth to consider the performance when you try to access elements or to add or to remove objects.
Arrays
To access a random element will cost you the same as accessing the first or last, as elements follow sequentially each other so they are accessed directly. They will cost you 1 cycle.
Inserting an element is costly. If you add to the beginning it will cost you 1 cycle. Inserting to the middle, the remainder needs to be shifted. It can cost you as much as N cycle in worst case (average N/2 cycles). If you append to the end and you have enough room in the array it will cost you 1 cycle. Otherwise the whole array will be copied which will cost you N cycle. This is why it is important to assign enough space to the array at the beginning of the operation.
Deleting from the beginning or the end it will cost you 1. From the middle shift operation is required. In average it is N/2.
Finding element with a given property will cost you N/2 cycle.
So be very cautious with huge arrays.
Dictionaries
While Dictionaries are disordered they can bring you some benefits here. As keys are hashed and stored in a hash table any given operation will cost you 1 cycle. Only exception can be finding an element with a given property. It can cost you N/2 cycle in the worst case. With clever design however you can assign property values as dictionary keys so the lookup will cost you 1 cycle only no matter how many elements are inside.
Swift Collections - Array, Dictionary, Set
Every collection is dynamic that is why it has some extra steps for expanding and collapsing. Array should allocate more memory and copy an old date into new one, Dictionary additionally should recalculate basket indexes for every object inside
Big O (O) notation describes a performance of some function
Array - ArrayList - a dynamic array of objects. It is based on usual array. It is used for task where you very often should have an access by index
get by index - O(1)
find element - O(n) - you try to find the latest element
insert/delete - O(n) - every time a tail of array is copied/pasted
Dictionary - HashTable, HashMap - saving key/value pairs. It contains a buckets/baskets(array structure, access by index) where each of them contains another structure(array list, linked list, tree). Collisions are solved by Separate chaining. The main idea is:
calculate key's hash code[About] (Hashable) and based on this hash code the index of bucket is calculated(for example by using modulo(mod)).
Since Hashable function returns Int it can not guarantees that two different objects will have different hash codes. More over count of basket is not equals Int.max. When we have two different objects with the same hash codes, or situation when two objects which have different hash codes are located into the same basket - it is a collision. Than is why when we know the index of basket we should check if anybody there is the same as our key, and Equatable is to the rescue. If two objects are equal the key/value object will be replaces, otherwise - new key/value object will be added inside
find element - O(1) to O(n)
insert/delete - O(1) to O(n)
O(n) - in case when hash code for every object is the same, that is why we have only one bucket. So hash function should evenly distributes the elements
As you see HashMap doesn't support access by index but in other cases it has better performance
Set - hash Set. Is based on HashTable without value
*Also you are able to implement a kind of Java TreeMap/TreeSet which is sorted structure but with O(log(n)) complexity to access an element
[Java Thread safe Collections]
I want to store a small amount of items( less than 255) which have constant size (a c char )and be able to do the following operations:
Append a value to an arbitrary position and have the other items preserve their previous order.
Delete an item and have the other items preserve their order(as above).
Find the next and previous of an item.
I have tried using an array and making a function to add a value by moving all items after it a place forward.Same thing can happen with deleting, but it is too inefficient.Of course, I do not mind having to use a library, long as it is readily available and free.
Array - access: O(1), insert: O(n)
Double-linked list - access O(n), previous/next: O(1), insert(*): O(1)
RB tree with number of childs stored: O(log n) for all operations.
(*): You need the traverse the list first to get to the position (O(n)).
Note: no, the array is not messy, it's really simple to implement. Also as you can see, depending on the usage, it can be quite efficient.
Based on the number of elements, and your remark to array implementation you should stick to arrays.
You could use a double-linked list for it. However, this won't work if you want to keep the array behaviour (e.g. accessing elements quickly (O(1), for a LL it's O(n)) by their index)
Requirements/constraint:
delete only duplicates
keep one copy
list is not initially sorted
How can this be implemented in C?
(An algorithm and/or code would be greatly appreciated!)
If the list is very long and you want reasonable performances and you are OK with allocating an extra log(n) of memory, you can sort in nlog(n) using qsort or merge sort:
http://swiss-knife.blogspot.com/2010/11/sorting.html
Then you can remove duplicates in n (the total is: nlog(n) + n)
If your list is very tiny, you can do like jswolf19 suggest, and you will get: n(n-1)/2 worst.
There are several different ways of detecting/deleting duplicates:
Nested loops
Take the next value in sequence, then scan until the end of the list to see if this value occurs again. This is O(n2) -- although I believe the bounds can be argued lower? -- but the actual performance may be better as only scanning from i to end (not 0 to end) is done and it may terminate early. This does not require extra data aside from a few variables.
(See Christoph's answer as how this could be done just using a traversal of the linked list and destructive "appending" to a new list -- e.g. the nested loops don't have to "feel" like nested loops.)
Sort and filter
Sort the list (mergesort can be modified to work on linked lists) and then detect duplicate values (they will be side-by-side now). With a good sort this is O(n*lg(n)). The sorting phase usually is/can be destructive (e.g. you have "one copy") but it has been modified ;-)
Scan and maintain a look-up
Scan the list and as the list is scanned add the values to a lookup. If the lookup already contains said values then there is a duplicate! This approach can be O(n) if the lookup access is O(1). Generally a "hash/dictionary" or "set" is used as the lookup, but if only a limited range of integrals are used then an array will work just fine (e.g. the index is the value). This requires extra storage but no "extra copy" -- at least in the literal reading.
For small values of n, big-O is pretty much worthless ;-)
Happy coding.
I'd either
mergesort the list followed by a linear scan to remove duplicates
use an insertion-sort based algorithm which already removes duplicates when re-building the list
The former will be faster, the latter is easier to implement from scratch: Just construct a new list by popping off elements from your old list and inserting them into the new one by scanning it until you hit an element of greater value (in which case you insert the element into the list) or equal value (in which case you discard the element).
Well, you can sort the list first and then check for duplicates, or you could do one of the following:
for i from 0 to list.length-1
for j from i+1 to list.length-1
if list[i] == list[j]
//delete one of them
fi
loop
loop
This is probably the most unoptimized piece of crap, but it'll probably work.
Iterate through the list, holding a pointer to the previous object every time you go on to the next one. Inside your iteration loop iterate through it all to check for a duplicate. If there is a duplicate, now back in the main iteration loop, get the next object. Set the previous objects pointer to the next object to the object you just retrieved, then break out of the loop and restart the whole process till there are no duplicates.
You can do this in linear time using a hash table.
You'd want to scan through the list sequentially. Each time you encounter an odd numbered element, look it up in your hash table. If that number is already in the hash table, delete it from the list, if not add it to the hash table and continue.
Basically the idea is that for each element you scan in the list, you are able to check in constant time whether it is a duplicate of a previous element that you've seen. This takes only a single pass through your list and will take at worst a linear amount of memory (worst case is that every element of the list is a unique odd number, thus your hash table is as long as your list).
I am thinking of sorting and then doing binary search. Is that the best way?
I advocate for hashes in such cases: you'll have time proportional to common size of both arrays.
Since most major languages offer hashtable in their standard libraries, I hardly need to show your how to implement such solution.
Iterate through each one and use a hash table to store counts. The key is the value of the integer and the value is the count of appearances.
It depends. If one set is substantially smaller than the other, or for some other reason you expect the intersection to be quite sparse, then a binary search may be justified. Otherwise, it's probably easiest to step through both at once. If the current element in one is smaller than in the other, advance to the next item in that array. When/if you get to equal elements, you send that as output, and advance to the next item in both arrays. (This assumes, that as you advocated, you've already sorted both, of course).
This is an O(N+M) operation, where N is the size of one array, and M the size of the other. Using a binary search, you get O(N lg2 M) instead, which can be lower complexity if one array is lot smaller than the other, but is likely to be a net loss if they're close to the same size.
Depending on what you need/want, the versions that attempt to just count occurrences can cause a pretty substantial problem: if there are multiple occurrences of a single item in one array, they will still count that as two occurrences of that item, indicating an intersection that doesn't really exist. You can prevent this, but doing so renders the job somewhat less trivial -- you insert items from one array into your hash table, but always set the count to 1. When that's finished, you process the second array by setting the count to 2 if and only if the item is already present in the table.
Define "best".
If you want to do it fast, you can do it O(n) by iterating through each array and keeping a count for each unique element. Details of how to count the unique elements depend on the alphabet of things that can be in the array, eg, is it sparse or dense?
Note that this is O(n) in the number of arrays, but O(nm) for arrays of length m).
The best way is probably to hash all the values and keep a count of occurrences, culling all that have not occurred i times when you examine array i where i = {1, 2, ..., n}. Unfortunately, no deterministic algorithm can get you less than an O(n*m) running time, since it's impossible to do this without examining all the values in all the arrays if they're unsorted.
A faster algorithm would need to either have an acceptable level of probability (Monte Carlo), or rely on some known condition of the lists to examine only a subset of elements (i.e. you only care about elements that have occurred in all i-1 previous lists when considering the ith list, but in an unsorted list it's non-trivial to search for elements.