Complexity on sorting or not an integer array - arrays

I have an array of integers storing some userIDs. I basically want to prevent a user from performing an action twice, so the moment he has done it his userID enters this array.
I wonder whether it is a good idea to sort or not this array. If it is sorted, then you have A={min, ..., max}. Then, if I'm not wrong, checking if an ID is in the array will take log2(|A|) 'steps'. On the other hand, if the array was not sorted then you will need |A|/2 (in average) steps.
So sorting seems better to check if an element exists in the array (log(|A|) vs |A|), but what about 'adding' a new value? Calculating the position of where the new userID should be can be done at the same time you're checking, but then you will have to displace all the elements from that position by 1... or at least that's how I'd do it on C, truth is this is going to be an array in a MongoDB document, so perhaps this is handled in some other most-effective way.
Of course if the array is unsorted then adding a new value will just take one step ("pushing" it to the end).
To me, an adding operation (with previous checking) will take:
If sorted: log2(|A|) + |A|/2. The log2 part to check and find the place and the |A|/2 as an average of the displacements needed.
If not sorted: |A|/2 + 1. The |A|/2 to check and the +1 to push the new element.
Given that for adding you'll always first check, then the not sorted version appears to have less steps, but truth is I'm not very confident on the +|A|/2 of the sorted version. That's how I would do it in C, but maybe it can work another way...

O(Log(A)) is definitely better than O(A), but this can be done in O(1). The data structure you are looking for is HashMap, if you are going to do this in C. I haven't worked in C in a very long time so I don't know if it is natively available now. It surely is available in C++. Also there are some libraries which you can use in the worst case.
For MongoDB, my solution may not be the best, but I think that you can create another collection of just the userIDs and index the collection keyed on userIDs. This way when someone tries to do that action, you can query the user status quickest.
Also in MongoDB you can try adding another key called UserDidTheAction to your User's collection. This key's value may be true or false. Index the collection based on userID and probably you will have similar performance as the other solution, but at the cost of modifying your original collection's design (though it's not required to be fixed in MongoDB).

Related

Why is looking for an item in a hash map faster than looking for an item in an array?

You might have come across someplace where it is mentioned that it is faster to find elements in hashmap/dictionary/table than list/array. My question is WHY?
(inference so far I made: Why should it be faster, as far I see, in both data structure, it has to travel throughout till it reaches the required element)
Let’s reason by analogy. Suppose you want to find a specific shirt to put on in the morning. I assume that, in doing so, you don’t have to look at literally every item of clothing you have. Rather, you probably do something like checking a specific drawer in your dresser or a specific section of your closet and only look there. After all, you’re not (I hope) going to find your shirt in your sock drawer.
Hash tables are faster to search than lists because they employ a similar strategy - they organize data according to the principle that every item has a place it “should” be, then search for the item by just looking in that place. Contrast this with a list, where items are organized based on the order in which they were added and where there isn’t a a particular pattern as to why each item is where it is.
More specifically: one common way to implement a hash table is with a strategy called chained hashing. The idea goes something like this: we maintain an array of buckets. We then come up with a rule that assigns each object a bucket number. When we add something to the table, we determine which bucket number it should go to, then jump to that bucket and then put the item there. To search for an item, we determine the bucket number, then jump there and only look at the items in that bucket. Assuming that the strategy we use to distribute items ends up distributing the items more or less evenly across the buckets, this means that we won’t have to look at most of the items in the hash table when doing a search, which is why the hash table tends to be much faster to search than a list.
For more details on this, check out these lecture slides on hash tables, which fills in more of the details about how this is done.
Hope this helps!
To understand this you can think of how the elements are stored in these Data structures.
HashMap/Dictionary as you know it is a key-value data structure. To store the element, you first find the Hash value (A function which always gives a unique value to a key. For example, a simple hash function can be made by doing the modulo operation.). Then you basically put the value against this hashed key.
In List, you basically keep appending the element to the end. The order of the element insertion would matter in this data structure. The memory allocated to this data structure is not contiguous.
In Array, you can think of it as similar to List. But In this case, the memory allocated is contiguous in nature. So, if you know the value of the address for the first index, you can find the address of the nth element.
Now think of the retrieval of the element from these Data structures:
From HashMap/Dictionary: When you are searching for an element, the first thing that you would do is find the hash value for the key. Once you have that, you go to the map for the hashed value and obtain the value. In this approach, the amount of action performed is always constant. In Asymptotic notation, this can be called as O(1).
From List: You literally need to iterate through each element and check if the element is the one that you are looking for. In the worst case, your desired element might be present at the end of the list. So, the amount of action performed varies, and in the worst case, you might have to iterate the whole list. In Asymptotic notation, this can be called as O(n). where n is the number of elements in the list.
From array: To find the element in the array, what you need to know is the address value of the first element. For any other element, you can do the Math of how relative this element is present from the first index.
For example, Let's say the address value of the first element is 100. Each element takes 4 bytes of memory. The element that you are looking for is present at 3rd position. Then you know the address value for this element would be 108. Math used is
Addresses of first element + (position of element -1 )* memory used for each element.
That is 100 + (3 - 1)*4 = 108.
In this case also as you can observe the action performed is always constant to find an element. In Asymptotic notation, this can be called as O(1).
Now to compare, O(1) will always be faster than O(n). And hence retrieval of elements from HashMap/Dictionary or array would always be faster than List.
I hope this helps.

Is there an algorithm that puts elements with equal keys in groups faster than sorting the elements?

Some elements with integer keys are in an array. I want the elements with equal keys to be in groups inside the array. This can be accomplished by sorting the elements, however, it does not matter to me whether the elements are sorted, only that they are in groups of equal keys. Is there a way to accomplish this that is faster than sorting?
A hash map should work well on average. Use a "count" for the value, which gets incremented each time you see the corresponding key in the array, and then use those counts to overwrite your array.
That said, calling "sort" is still pretty fast and easier to read. A good quicksort can actually avoid some work when duplicates exist, so you should really run some benchmarks to be sure that an uglier approach is fast enough to be worthwhile.

What C construct would allow me to 'reverse reference' an array?

Looking for an elegant way (or a construct with which I am unfamiliar) that allows me to do the equivalent of 'reverse referencing' an array. That is, say I have an integer array
handle[number] = nameNumber
Sometimes I know the number and need the nameNumber, but sometimes I only know the nameNumber and need the matching [number] in the array.
The integer nameNumber values are each unique, that is, no two nameNumbers that are the same, so every [number] and nameNumber pair are also unique.
Is there a good way to 'reverse reference' an array value (or some other construct) without having to sweep the entire array looking for the matching value, (or having to update and keep track of two different arrays with reverse value sets)?
If the array is sorted and you know the length of it, you could binary search for the element in the array. This would be an O(n log(n)) search instead of you doing O(n) search through the array. Divide the array in half and check if the element at the center is greater or less than what you're looking for, grab the half of the array your element is in, and divide in half again. Each decision you make will eliminate half of the elements in the array. Keep this process going and you'll eventually land on the element you're looking for.
I don't know whether it's acceptable for you to use C++ and boost libraries. If yes you can use boost::bimap<X, Y>.
Boost.Bimap is a bidirectional maps library for C++. With Boost.Bimap you can create associative containers in which both types can be used as key. A bimap can be thought of as a combination of a std::map and a std::map.

Array vs Dictionary search performance in Swift

I think it's probably a simple answer but I thought I'd quickly check...
Let's say I'm adding Ints to an array at various points in my code, and then I want to find if an array contains a certain Int in the future..
var array = [Int]()
array.append(2)
array.append(4)
array.append(5)
array.append(7)
if array.contains(7) { print("There's a 7 alright") }
Is this heavier performance wise than if I created a dictionary?
var dictionary = [Int:Int]()
dictionary[7] = 7
if dictionary[7] != nil { print("There's a value for key 7")}
Obviously there's reasons like, you might want to eliminate the possibility of having duplicate entries of the same number... but I could also do that with a Set.. I'm mainly just wondering about the performance of dictionary[key] vs array.contains(value)
Thanks for your time
Generally speaking, Dictionaries provide constant, i.e. O(1), access, which means searching if a value exists and updating it are faster than with an Array, which, depending on implementation can be O(n). If those are things that you need to optimize for, then a Dictionary is a good choice. However, since dictionaries enforce uniqueness of keys, you cannot insert multiple values under the same key.
Based on the question, I would recommend for you to read Ray Wenderlich's Collection Data Structures to get a more holistic understanding of data structures than I can provide here.
I did some sampling!
I edited your code so that the print statements are empty.
I ran the code 1.000.000 times. Every time I measured how long it takes to access the dictionary and array separately. Then I subtracted the dictTime for arrTime (arrTime - dictTime) and saved this number each time.
Once it finished I took the average of the results.
The result is: 23150. Meaning that over 1.000.000 tries the array was faster to access by 23150 nanoSec.
The max difference was 2426737 and the min was -5711121.
Here are the results on a graph:

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Resources