Do maps with char-type keys have faster access time than normal arrays?
The reason I think this is true is because normal arrays have integer-type indexing while the maps I think about have char-type indexing.
Integers are 4 bytes while chars are only 1 byte, so it seems reasonable to believe that accessing a map item at a given char key is faster than accessing a normal array item at a given integer index. In other words, the CPU has fewer bytes of the index/key value to examine to determine which element in the array is being referred to in memory.
Maps are slower than Arrays.
Because, Maps are actually implementation of arrays.
But for larger amount of data, you can use HashMap since you get rid of comparisons (if used correctly).
Related
I have a series of fixed size arrays of binary values (individuals from a genetic algorithm) that I would like to associate with a floating point value (fitness value). Such look up table would have a fairly large size constrained by available memory. Due to the nature of the keys is there a hash function that would guarantee no collisions? I tried a few things but they result in collisions. What other data structure could I use to build this look up system?
To answer your questions:
There is no hash function that guarantees no collisions unless you make a hash function that encodes completely the bit array, meaning that given the hash you can reconstruct the bit array. This type of function would be a compression function. If your arrays have a lot of redundant information (for example most of the values are zeros), compressing them could be useful to reduce the total size of the lookup table.
A question on compressing bit array in C is answered here: Compressing a sparse bit array
Since you have most of the bits set to zero, the easiest solution would be to just write a function that converts your bit array in an integer array that keeps track of the positions of the bits that are set to '1'. Then write a function that does the opposite if you need the bit array again. You can save in the hashmap only the encoded array.
Another option to reduce the total size of the lookup table is to erase the old values. Since you are using a genetic algorithm, the population should change over time and old values should become useless, you could periodically remove the older values from the lookup table.
In Skiena's Algorithm Design Manual, he mentions at one point:
The primary thing lost using dynamic arrays is the guarantee that each array
access takes constant time in the worst case. Now all the queries will be fast, except
for those relatively few queries triggering array doubling. What we get instead is a
promise that the nth array access will be completed quickly enough that the total
effort expended so far will still be O(n).
I'm struggling to understand this. How will an array query expand the array?
Dynamic arrays are arrays where the size does not need to be specified (Think of an ArrayList in java). Under the hood, dynamic arrays are implemented using a regular array. Though, because it's a regular array the implementation of the ArrayList needs to specify the size of the underlying array.
So the typical way to handle this in dynamic arrays is to initialize the standard array with a certain amount of elements, then when it reached it's maximum elements, the array is doubled in size.
Because of this underlying functionality, most of the time it will take constant time when adding to a dynamic array, but occasionally it will double the size of the 'under the hood' standard array which will take longer than the normal add time.
If your confusion lies with his use of the word 'query', I believe he means to say 'adding or removing from the array' because a simple 'get' query shouldn't be related to the underlying standard array size.
I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.
Say I want to have some kind of a bitmap to know the number of times a particular char appears in a string.
So, for example, if I read the string "abracadabra", I would have a data structure that would look something like this:
a -> 5
b -> 2
r -> 2
c -> 1
d -> 1
I have read a book (Programming INterviews Exposed) that says the following:
Hashtables have a higher lookup overhead than arrays.
An array would need an element for every possible character.
A hashtable would need to store just the characters that actually appear in the string.
Therefore:
Arrays are a better choice for long strings with a limited set of possible characters and hash tables are more efficient for shorter strings or when there are many possible character values.
I don't understand why:
-> Hashtables have a higher lookup overhead than arrays? Why is that?
An array is an extremely simple data structure. In memory, it is a simple contiguous block. Say each item in the array is four bytes, and the array has room for 100 elements. Then the array is simply 400 contiguous bytes in memory, and your variable assigned to the array is a pointer to the first element. Say this is at location 10000 in memory.
When you access element #3 of the array, like this:
myarray[3] = 17;
...what happens is very simple: 3 multiplied by the element size (4 bytes) is added to the base pointer. In this example it's 10000 + 3 * 4 = 10012. Then you simply write to the 4 bytes located at address 10012. Trivially simple math.
A hashtable is not an elementary data structure. It could be implemented in various ways, but a simple one might be an array of 256 lists. Then when you access the hashtable, first you have to calculate the hash of your key, then look up the right list in the array, and finally walk along the list to find the right element. This is a much more complicated process.
A simple array is always going to be faster than a hashtable. What the text you cite is getting at is that if the data is very sparse... you might need a very large array to do this simple calculation. In that case you could use a lot less memory space to hold the hash table.
Consider if your characters were Unicode -- two bytes each. That's 65536 possible characters. And say you're only talking about strings with 256 or fewer characters. To count those characters with an array, you would need to make an array with 64K elements, one byte each... taking 64K of memory. The hashtable on the other hand, implemented like I mentioned above, might take only 4*64 bytes for the array of list pointers, and then 5-8 bytes per list element. So if you were processing a 256-character string with say 64 unique Unicode characters used, it would take up a total of at most 768 bytes. Under these conditions, the hashtable would be using much less memory. But it's always going to be slower.
Finally, in the simple case you show, you're probably just talking about the Latin alphabet, so if you force lowercase, you could have an array with just 26 elements, and make them as large as you want so you could count as many characters as you'll need. Even if it's 4 billion, you would need just 26 * 4 = 104 character array. So that's definitely the way to go here.
Hashtables have a higher lookup overhead than arrays? Why is that?
When accessing an array for a charcter counting it is a direct access:
counter[c]++;
While a hastable is a (complex) data structure, where first a hash function must be calculated, then a second function to reduce the hascode to hash table position.
If the table position already is used, additional action has to be done.
I personally think, that as long as your characters are in Asci Range (0-255) the array approach is always faster, and more suited. If it comes to uni code character (which in java is the default in Strings, then the hashtable is more appropriate.)
Hashtables have a higher lookup overhead than arrays? Why is that?
Because they have to search for the key calculate the hash from the key.
In contrast, arrays have O(1) lookup time. For accessing a value in an array, typically calculating the offset and returning the element at that offset is enough, this works in constant time.
Is there any tricky way to implement a set data structure (a collection of unique values) in C? All elements in a set will be of the same type and there is a huge RAM memory.
As I know, for integers it can be done really fast'N'easy using value-indexed arrays. But I'd like to have a very general Set data type. And it would be nice if a set could include itself.
There are multiple ways of implementing set (and map) functionality, for example:
tree-based approach (ordered traversal)
hash-based approach (unordered traversal)
Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.
Beware of the advantages and disadvantages of hash-based vs. tree-based approaches.
You can design a hash-set (a special case of hash-tables) of pointers to hashable PODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:
all hashables in a bucket have the same hash value
a bucket can be implemented as a dynamic array or linked list of hashables
a hashable's hash value is used to index into the array of buckets (hash-value-indexed array)
one or more of the hashables contained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)
With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.
You would have to implement:
the hash function for the type being hashed
an equality function for the type being used to test whether two hashables are equal or not
the hash-set contains/insert/remove functionality.
You can also use open addressing as an alternative to maintaining and managing buckets.
Sets are usually implemented as some variety of a binary tree. Red black trees have good worst case performance.
These can also be used to build an map to allow key / value lookups.
This approach requires some sort of ordering on the elements of the set and the key values in a map.
I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.
The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.
The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.
Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find a and b to be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp() to compare the stored string with the probing string.
(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)
Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.
If the maximum number of elements in the set (the cardinality of the underlying data type) is small enough, you might want to consider using a plain old array of bits (or whatever you call them in your favourite language).
Then you have a simple set membership check: bit n is 1 if element n is in the set. You could even count 'ordinary' members from 1, and only make bit 0 equal to 1 if the set contains itself.
This approach will probably require some sort of other data structure (or function) to translate from the member data type to the position in the bit array (and back), but it makes basic set operations (union, intersection, membership test, difference, insertion, removal,compelment) very very easy. And it is only suitable for relatively small sets, you wouldn't want to use it for sets of 32-bit integers I don't suppose.