Why does having an index actually speed up look-up time? - arrays

I've always wondered about why this is the case.
For instance, say I want to find the number 5 located in an array of numbers. I have to compare my desired number against every other single value, to find what I'm looking for.
This is clearly O(N).
But, say for instance, I have an index that I know contains my desired item. I can just jump right to it right? And this is also the case with Maps that are hashed, because as I provide a key to lookup, the same hash function is ran on the key that determined it's index position, so this also allows me to just then, jump right to it's correct index.
But my question is why is that any different than the O(N) lookup time for finding a value in an array through direct comparison?
As far as a naive computer is concerned, shouldn't an index be the same as looking for a value? Shouldn't the raw operation still be, as I traverse the structure, I must compare the current index value to the one I know I'm looking for?
It makes a great deal of sense why something like binary search can achieve O(logN), but I still can't intuitively grasp why certain things can be O(1).
What am I missing in my thinking?

Arrays are usually stored as a large block of memory.
If you're looking for an index, this allows you to calculate the offset that that index will have in this block of memory in O(1).
Say the array starts at memory address 124 and each element is 10 bytes large, then you can know the 5th element is at address 124 + 10*5 = 174.
Binary search will actually (usually) do something similar (since by-index lookup is just O(1) for an array) - you start off in the middle - you do a by-index lookup to get that element. Then you look at the element at either the 1/4th or 3/4th position, which you need to do a by-index lookup for again.

A HashMap has an array underneath it. When an key/value pair is added to the map. The key's hashCode() is evaluated and normalized so that its value can be placed in its special index in the array. When two key's codes are normalized to belong to the same index of the map, they are appended to a LinkedList
When you perform a look-up, the key you are looking up has its hash code() evaluated and normalized to return an index to search for the key. It then traverses the linked list you find the key and returns the associated value.
This look-up time is the same, in the best case, as looking-up array[i] which is O(1)
The reason it is a speed up is because you don't actually have to traverse your structure to look something up, you just jump right to the place where you expect it to be.

Related

Why is looking for an item in a hash map faster than looking for an item in an array?

You might have come across someplace where it is mentioned that it is faster to find elements in hashmap/dictionary/table than list/array. My question is WHY?
(inference so far I made: Why should it be faster, as far I see, in both data structure, it has to travel throughout till it reaches the required element)
Let’s reason by analogy. Suppose you want to find a specific shirt to put on in the morning. I assume that, in doing so, you don’t have to look at literally every item of clothing you have. Rather, you probably do something like checking a specific drawer in your dresser or a specific section of your closet and only look there. After all, you’re not (I hope) going to find your shirt in your sock drawer.
Hash tables are faster to search than lists because they employ a similar strategy - they organize data according to the principle that every item has a place it “should” be, then search for the item by just looking in that place. Contrast this with a list, where items are organized based on the order in which they were added and where there isn’t a a particular pattern as to why each item is where it is.
More specifically: one common way to implement a hash table is with a strategy called chained hashing. The idea goes something like this: we maintain an array of buckets. We then come up with a rule that assigns each object a bucket number. When we add something to the table, we determine which bucket number it should go to, then jump to that bucket and then put the item there. To search for an item, we determine the bucket number, then jump there and only look at the items in that bucket. Assuming that the strategy we use to distribute items ends up distributing the items more or less evenly across the buckets, this means that we won’t have to look at most of the items in the hash table when doing a search, which is why the hash table tends to be much faster to search than a list.
For more details on this, check out these lecture slides on hash tables, which fills in more of the details about how this is done.
Hope this helps!
To understand this you can think of how the elements are stored in these Data structures.
HashMap/Dictionary as you know it is a key-value data structure. To store the element, you first find the Hash value (A function which always gives a unique value to a key. For example, a simple hash function can be made by doing the modulo operation.). Then you basically put the value against this hashed key.
In List, you basically keep appending the element to the end. The order of the element insertion would matter in this data structure. The memory allocated to this data structure is not contiguous.
In Array, you can think of it as similar to List. But In this case, the memory allocated is contiguous in nature. So, if you know the value of the address for the first index, you can find the address of the nth element.
Now think of the retrieval of the element from these Data structures:
From HashMap/Dictionary: When you are searching for an element, the first thing that you would do is find the hash value for the key. Once you have that, you go to the map for the hashed value and obtain the value. In this approach, the amount of action performed is always constant. In Asymptotic notation, this can be called as O(1).
From List: You literally need to iterate through each element and check if the element is the one that you are looking for. In the worst case, your desired element might be present at the end of the list. So, the amount of action performed varies, and in the worst case, you might have to iterate the whole list. In Asymptotic notation, this can be called as O(n). where n is the number of elements in the list.
From array: To find the element in the array, what you need to know is the address value of the first element. For any other element, you can do the Math of how relative this element is present from the first index.
For example, Let's say the address value of the first element is 100. Each element takes 4 bytes of memory. The element that you are looking for is present at 3rd position. Then you know the address value for this element would be 108. Math used is
Addresses of first element + (position of element -1 )* memory used for each element.
That is 100 + (3 - 1)*4 = 108.
In this case also as you can observe the action performed is always constant to find an element. In Asymptotic notation, this can be called as O(1).
Now to compare, O(1) will always be faster than O(n). And hence retrieval of elements from HashMap/Dictionary or array would always be faster than List.
I hope this helps.

What C construct would allow me to 'reverse reference' an array?

Looking for an elegant way (or a construct with which I am unfamiliar) that allows me to do the equivalent of 'reverse referencing' an array. That is, say I have an integer array
handle[number] = nameNumber
Sometimes I know the number and need the nameNumber, but sometimes I only know the nameNumber and need the matching [number] in the array.
The integer nameNumber values are each unique, that is, no two nameNumbers that are the same, so every [number] and nameNumber pair are also unique.
Is there a good way to 'reverse reference' an array value (or some other construct) without having to sweep the entire array looking for the matching value, (or having to update and keep track of two different arrays with reverse value sets)?
If the array is sorted and you know the length of it, you could binary search for the element in the array. This would be an O(n log(n)) search instead of you doing O(n) search through the array. Divide the array in half and check if the element at the center is greater or less than what you're looking for, grab the half of the array your element is in, and divide in half again. Each decision you make will eliminate half of the elements in the array. Keep this process going and you'll eventually land on the element you're looking for.
I don't know whether it's acceptable for you to use C++ and boost libraries. If yes you can use boost::bimap<X, Y>.
Boost.Bimap is a bidirectional maps library for C++. With Boost.Bimap you can create associative containers in which both types can be used as key. A bimap can be thought of as a combination of a std::map and a std::map.

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

What data structure to use here

Hashes provide an excellent mechanism to extract values corresponding to some given key in almost O(1) time. But it never preserves the order in which the keys are inserted. So is there any data structure which can simulate the best of array as well as hash, that is, return the value corresponding to a given key in O(1) time, as well as returning the nth value inserted in O(1) time? The ordering should be maintained, i.e., if the hash is {a:1,b:2,c:3}, and something like del hash[b] has been done, nth(2) should return {c,3}.
Examples:
hash = {};
hash[a] = 1;
hash[b] = 2;
hash[c] = 3;
nth(2); //should return 2
hash[d] = 4;
del hash[c];
nth(3); //should return 4, as 'd' has been shifted up
Using modules like TIE::Hash or similar stuff won't do, the onus is on me to develop it from scratch!
It depends on how much memory may be allocated for this data structure. For O(N) space there are several choices:
It's easy to get a data structure with O(1) time for each of these operations: "get value by key", "get nth value inserted", "insert" - but only when "delete" time is O(N). Just use combination of a hash map and an array, as explained by ppeterka.
Less obvious, but still simple is O(sqrt N) for "delete" and O(1) for all other operations.
A little bit more complicated is to "delete" in O(N1/4), O(N1/6), or, in general case, in O(M*N1/M) time.
It's, most likely, impossible to decrease "delete" time to O(log N) while retaining O(1) for other operations. But it is possible if you agree to O(log N) time for every operation. Solutions, based on binary search tree or on a skip list, allow it. One option is order statistics tree. You can augment every node of a binary search tree with a counter, storing number of elements in the sub-tree under this node; then use it to find nth node. Other option is to use Indexable skiplist. One more option is to use O(M*N1/M) solution with M=log(N).
And I don't think you can get O(1) "delete" without increasing time for other operations even more.
If unlimited space is available, you can do every operation in O(1) time.
O(sqrt N) "delete"
You can use a combination of two data structures to find value by key and to find value by its insertion order. First one is a hash map (mapping key to both value and a position in other structure). Second one is tiered vector, which maps position to both value and key.
Tiered vector is a relatively simple data structure, it may be easily developed from scratch. Main idea is to split array into sqrt(N) smaller arrays, each of size sqrt(N). Each small array needs only O(sqrt N) time to shift values after deletion. And since each small array is implemented as circular buffer, small arrays can exchange a single element in O(1) time, which allows to complete "delete" operation in O(sqrt N) time (one such exchange for each sub-array between deleted value and first/last sub-array). Tiered vector allows insertion into the middle also in O(sqrt N), but this problem does not require it, so we can just append a new element at the end in O(1) time. To access element by its position, we need to determine starting position of circular buffer for sub-array, where element is stored, then get this element from circular buffer; this needs also O(1) time.
Since hash map remembers a position in tiered vector for each of its keys, it should be updated when any element in tiered vector changes position (O(sqrt N) hash map updates for each "delete").
O(M*N1/M) "delete"
To optimize "delete" operation even more, you can use approach, described in this answer. It deletes elements lazily and uses a trie to adjust element's position, taking into account deleted elements.
O(1) for every operation
You can use a combination of three data structures to do this. First one is a hash map (mapping key to both value and a position in the array). Second one is an array, which maps position to both value and key. And third one is a bit set, one bit for each element of the array.
"Insert" operation just adds one more element to the array's end and inserts it into hash map.
"Delete" operation just unsets corresponding bit in the bit set (which is initialized with every bit = 1). Also it deletes corresponding entry from hash map. (It does not move elements of array or bit set). If, after "delete" the bit set has more than some constant proportion of elements deleted (like 10%), the whole data structure should be re-created from scratch (this allows O(1) amortized time).
"Find by key" is trivial, only hash map is used here.
"Find by position" requires some pre-processing. Prepare a 2D array. One index is the position we search. Other index is current state of our data structure, the bit set, reinterpreted as an index. Calculate population count for each prefix of every possible bit set and store prefix length, indexed by both population count and the bit set itself. Having this 2D array ready, you can perform this operation by first indexing by position and current "state" in this 2D array, then by indexing in the array with values.
Time complexity for every operation is O(1) (for insert/delete it is O(1) amortized). Space complexity is O(N 2N).
In practice, using whole bit set to index an array limits allowed value of N by pointer size (usually 64), even more it is limited by available memory. To alleviate this, we can split both the array and the bit set into sub-arrays of size N/C, where C is some constant. Now we can use a smaller 2D array to find nth element in each sub-array. And to find nth element in the whole structure, we need additional structure to record number of valid elements in each sub-array. This is a structure of constant size C, so every operation on it is also O(1). This additional structure may me implemented as an array, but it is better to use some logarithmic-time structure like indexable skiplist. After this modification, time complexity for every operation is still O(1); space complexity is O(N 2N/C).
Now, that the question is clear for me too (better late than never...) here are my proposals:
you could maintain two hashes: one with keys, and one with the insert order. this however is very ugly and slow to maintain when deleting, and inserting in between. This would give the same almost O(1) time needed to access the elements both ways.
you could use a hash for the keys, and maintain an array for the insert order. this one is a lot nicer than the hash type, deleting is still not very fast, but I think still a lot quicker than with the two hash approach. This also gives true O(1) on accessing the nth element.
At first, I misunderstood the question, and gave a solution that gives O(1) key lookup, and O(n) lookup of nth element:
In Java, there is the LinkedHashMap for this particular task.
I think however that if someone finds this page, this might not be totally useless, so I leave it here...
There is no data structure in O(1) for everything you cited. In particular any data structure with random dynamic insertion/deletion in the middle AND sorted/indexed access cannot have maintenance time lower than O(log N), to maintain such a dynamic collection you have to resort either on the operator "less than" (binary thus O(log2 N)) or some computed organization (typical O(sqrt N), by using sqrt(N) sub arrays). Note that O(sqrt N)>O(log N).
So, no.
You might reach O(1) for everything including keeping order with the linked list+hash map, and if access is mostly sequential, you could cache nth(x), to access nth(x+/-1) in O(1).
I guess only a plain array will give you O(1), best variant is to look for solution which gives O(n) in worst scenario. You can also use a really really bad approach - using key as index in plain array. I guess there is a way to transform any key to index in plain array.
std::string memoryMap[0x10000];
int key = 100;
std::string value = "Hello, World!";
memoryMap[key] = value;

Inserting a number into a sorted array!

I would like to write a piece of code for inserting a number into a sorted array at the appropriate position (i.e. the array should still remain sorted after insertion)
My data structure doesn't allow duplicates.
I am planning to do something like this:
Find the right index where I should be putting this element using binary search
Create space for this element, by moving all the elements from that index down.
Put this element there.
Is there any other better way?
If you really have an array and not a better data structure, that's optimal. If you're flexible on the implementation, take a look at AA Trees - They're rather fast and easy to implement. Obviously, takes more space than array, and it's not worth it if the number of elements is not big enough to notice the slowness of the blit as compared to pointer magic.
Does the data have to be sorted completely all the time?
If it is not, if it is only necessary to access the smallest or highest element quickly, Binary Heap gives constant access time and logn addition and deletion time.
More over it can satisfy your condition that the memory should be consecutive, since you can implement a BinaryHeap on top of an array (I.e; array[2n+1] left child, array[2n+2] right child).
A heap based implementation of a tree would be more efficient if you are inserting a lot of elements - log n for both locating/removing and inserting operations.

Resources