I am looking to compare two values (like greater than or less than from other) in HashMap, Hashtable, Map or any other Array types.
Could you please help me this.
Here are some factors that would affect your selection of a data structure:
What is the purpose of the comparison?
What type of data are you comparing?
How often will data be inserted into this data structure?
How often will data be selected from this data structure?
When should you use a HashMap?
One should use HashMap when their major requirements are only retrieving or modifying data's based on Key. For example, in Web
Applications username is stored as a key and user data is stored as a
value in the HashMap, for faster retrieval of user data corresponding
to a username.
HashMap
When should you use a HashTable?
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by
someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can
be wasteful and slow when excessive memory's used), the majority of
implementations will outgrow the arrays they're using for the hash
table every now and then, then allocate a bigger array and copy
content across: this can make the specific insertions that cause this
rehashing to be much slower than the normal O(1) behaviour, even
though the average is still O(1); if you need more consistent
behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating
on elements with keys that are "nearby" in some specific sort order),
such that cache efficiency is better for other storage models that
keep them nearby in memory (e.g. bucket sorted elements), even if
you're not exactly relying on the sort order for e.g. iteration
HashTable
Related
I have a collection of objects(max 500).
My entries will be looked up frequently based on a MAC like key, whose range is unknown.
Now, I am confused as to which data structure and algorithm to use for effective look up of values.
I am not sure whether to go for a balanced BST (AVL) or a hashtable for this case.
Are 500 keys small for building hashtables?
What would be the best approach in my case?
I read that computing hash might prove costly when the number of keys is less
On a side note, I would also like to know what number of entries (min) need to be present for considering a hashtable?
Please add a comment if further details are needed.
Thanks in advance.
Below are some of the benefits of Hash structures,
Fast lookup (O(1) theoretically)
Efficient storage (helps to store key-value)
Though these properties are beneficial but in some scenarios hash table can underperform.
If you have large amount of objects then more storage space (Memory) will be required and thus can cause performance hit
Hashing/key algorithm should not be complex. Otherwise more time will spent on hashing and finding key.
Key collision should be minimal to avoid linear search in all values for single key or key duplication.
In your case if hashing algo is not too complex then definitely you can use hashtable as you only 500 objects. If you have lookup intensive workflow then hashtable can save lot of time. If your data is nearly static then don't worry about initial loading time because your lookup time will be much faster.
You can also look at another DS which are efficient for less values like, Hash set, AVL trees, Hash tree. For 500 objects time diff will be milli or micro second diff in linear search and hash search. thus you wont achieve much perf. improvement. Thus look for easiness and readability.
I know that I can simply use bucket array for associative container if I have uniformly distributed integer keys or keys that can be mapped into uniformly distributed integers. If I can create the array big enough to ensure a certain load factor (which assumes the collection is not too dynamic), than the expected number of collisions for a key will be bounded, because this is simply hash table with identity hash function.
Edit: I view strings as equivalent to positional fractions in the range [0..1]. So they can be mapped into any integer range by multiplication and taking floor of the result.
I can also do prefix queries efficiently, just like with tries. I presume (without knowing a proof) that the expected number of empty slots corresponding to a given prefix that have to be skipped sequentially before the first bucket with at least one element is reached is also going to be bounded by constant (again depending on the chosen load factor).
And of course, I can do stabbing queries in worst-case constant time, and range queries in solely output sensitive linear expected time (if the conjecture of denseness from the previous paragraph is indeed true).
What are the advantages of a tries then?
If the distribution is uniform, I don't see anything that tries do better. But I may be wrong.
If the distribution has large uncompensated skew (because we had no prior probabilities or just looking at the worst case), the bucket array performs poorly, but tries also become heavily imbalanced, and can have linear worst case performance with strings of arbitrary length. So the use of either structure for your data is questionable.
So my question is - what are the performance advantages of tries over bucket arrays that can be formally demonstrated? What kind of distributions elicit those advantages?
I was thinking of distributions with self-similar structure at different scales. I believe those are called fractal distributions, of which I confess to know nothing. May be then, if the distribution is prone to clustering at every scale, tries can provide superior performance, by keeping the load factor of each node similar, adding levels at dense regions as necessary - something that bucket arrays can not do.
Thanks
Tries are good if your strings share common prefixes. In that case, the prefix is stored only once and can be queried with linear performance in the output string length. In a bucket array, all strings with the same prefixes would end up close together in your key space, so you have very skewed load where most buckets are empty and some are huge.
More generally, tries are also good if particular patterns (e.g. the letters t and h together) occur often. If there are many such patterns, the order of the trie's tree nodes will typically be small, and little storage is wasted.
One of the advantages of tries I can think of is insertion. Bucket array may need to be resized at some point and this is expensive operation. So worst-case insertion time into trie is much better than into bucket array.
Another thing is that you need to map string to fraction to be used with bucket arrays. So if you have short keys, theoretically trie can be more efficient, because you don't need to do the mapping.
by Random Access i do not mean selecting a random record,Random Access is the
ability to fetch all records in equal time,the same way values are fetched from an array.
From wikipedia: http://en.wikipedia.org/wiki/Random_access
my intention is to store a very large array of strings, one that is too big for memory.
but still have the benefit or random-access to the array.
I usally use MySQL but it seems it has only B-Tree and Hash index types.
I don't see a reason why it isn't possible to implement such a thing.
The indexes will be like in array, starting from zero and incrementing by 1.
I want to simply fetch a string by its index, not get the index according to the string.
The goal is to improve performance. I also cannot control the order in which the strings
will be accessed, it'll be a remote DB server which will constantly receive indexes from
clients and return the string for that index.
Is there a solution for this?
p.s I don't thing this is a duplicate of Random-access container that does not fit in memory?
Because in that question he has other demands except random access
Given your definition, if you just use an SSD for storing your data, it will allow for what you call random access (i.e. uniform access speed across the data set). The fact that sequential access is less expensive than random one comes from the fact that sequential access to disk is much faster than random one (and any database tries it's best to make up for this, btw).
That said, even RAM access is not uniform as sequential access is faster due to caching and NUMA. So uniform access is an illusion anyway, which begs the question, why you are so insisting of having it in the first place. I.e. what you think will go wrong when having slow random access - it might be still fast enough for your use case.
You are talking about constant time, but you mention a unique incrementing primary key.
Unless such a key is gapless, you cannot use it as an offset, so you still need some kind of structure to look up the actual offset.
Finding a record by offset isn't usually particularly useful, since you will usually want to find it by some more friendly method, which will invariably involve an index. Searching a B-Tree index is worst case O(log n), which is pretty good.
Assuming you just have an array of strings - store it in a disk file of fixed length records and use the file system to seek to your desired offset.
Then benchmark against a database lookup.
I have a number of data sets that have key-value pattern - i.e. a string key and a pointer to the data. Right now it is stored in hashtables, each table having array of slots corresponding to hash keys, and on collision forming a linked list under each slot that has collision (direct chaining). All implemented in C (and should stay in C) if it matters.
Now, the data is actually 3 slightly different types of data sets:
Some sets can be changed (keys added, removed, replaced, etc.) at will
For some sets data can be added but almost never replaced/removed (i.e. it can happen, but in practice it is very rare)
For some sets the data is added once and then only looked up, it is never changed once the whole set is loaded.
All sets of course have to support lookups as fast as possible, and consume minimal amounts of memory (though lookup speed is more important than size).
So the question is - is there some better hashtable structure/implementation that would suit the specific cases better? I suspect for the first case the chaining is the best, but not sure about two other cases.
If you are using linked lists for each bucket in your hashtable, you have already accepted relatively poor performance on modern CPUs (linked lists have poor locality and therefore poor CPU cache interaction). So I probably wouldn't worry about optimizing the other special cases. However, here are a few tips if you want to continue down the path you are using:
For the 'frequent changes' data set and the 'almost never change' cases, every time you read an item from the hash table, move it to the front of the linked list chain for that bucket. For some even better ideas this paper, even though it focus on fixed size keys, is a good staring point Fast and Compact Hash Tables for Integer Keys.
For the 'data set never changes' case you should look into the perfect hash generators. If you know your keys at compile time I've had good results with gperf. If your keys are not available until run-time try the C Minimal Perfect Hashing Library.
Those sets that are small (tens of elements) might be fastest using a binary or even linear search over the keys stored in sequential memory!
Obviously the key bodies have to be in the sequential memory, or hashes of them. But if you can get that into one or two L1 cache.lines, it'll fly.
As for the bigger hashes, the direct chaining might lose out to open addressing?
You could explore "cache conscious" hash tables and tries.
The wikipedia article discusses cache-lines in detail, describing the various trade-offs to consider.
What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.
Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.
my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).
Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work
If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.
We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.