I am building a maps tile storage, and need to store 1.5 billion ~3KB blobs. Over 95% of them are duplicate. Is there a NoSQL storage engine that would avoid storing identical values?
I could of course implement a double-de-referencing, e.g. key->hash->value. If hash is MD5, 16 byte hashes would use up 24GB for hashess alone, plus the per-item overhead, which is probably much more. Anything more efficient?
Thanks!
Double de-referencing is the way to go - you'd be saving somewhere between 4-5TB of data by not storing duplicate data, so storing a 24GB set of hashes is worth the cost. Additionally, you only need to compute the hash function on inserts and updates, not on lookups or deletions.
To reduce the cost of double de-referencing on lookups, you can supplement your on-disk key-value database with an in-memory cache, e.g. Redis - you can either cache frequently accessed key->hash pairs to avoid two lookups on the primary database, or else you can directly store the entire key->hash->blob structure in the cache (the former is much simpler to implement because you don't need to replicate the double de-referencing from the primary database, whereas the latter makes more sense if only a small subset of the blobs are ever active).
You may be able to use a simpler/smaller hash - the probability of a hash collision is 1 - e^(-k^2 / 2N) where k is the number of values being hashed and N is the size of the hash, so a good 64-bit hash has about a 12% chance of having a collision and a good 128-bit hash has an infinitesimal chance of having a collision. MurmurHash has 64 and 128-bit versions so you can experiment between the two, and it's faster than MD5 largely owing to MD5 being a cryptographic hash function whereas Murmur doesn't have the added expense/complexity of being cryptographically secure (I'm assuming that you're not concerned about anybody attempting to intentionally generate hash collisions or anything like that). Some key-value stores also make it relatively easy to make your design collision-tolerant, for example you could store the hash in a Riak Map with a flag indicating whether there have been any collisions on that hash value - if false then simply return the blob, else fall back on option 2 (e.g. the indexed blob becomes the two blobs with a hash collision zipped/tarred together along with a CSV of which keys correspond to which blob; even with a 64-bit hash this code path will not be exercised very often, and so implementation simplicity likely trumps performance); the question is whether the reduced memory/hashing overhead makes up for the complexity of collision tolerance.
Related
If both are well designed and programmed.
Does B+-tree have any advantages for an in-memory database compared to Hash if no range query is needed?
Not only range searches, B-tree (or a variant like B+) can also accommodate partial key searches (i.e. where you know the prefix but not the whole key value), sorted traversal, and easily allow duplicates (hash indexes customarily enforce uniqueness).
Hash index can potentially use less memory (b-trees always have empty space).
Hash tables can be allocated statically (size doesn't change) or dynamically. Static is best if you know how many key values are to be stored with reasonable accurateness. Dynamically allocated hash tables will have wasted space unless/until the new buckets are filled up. B-trees naturally grow as needed.
If a hash table is too small or the hash algorithm is inferior, there will be collisions that require chaining. That will increase the lookup and insertion time. Choosing the best hash algorithm has some dependency on the type of data being indexed, therefore it's virtually impossible to have a single generic hash algorithm that is optimal for everything. B-trees don't have this issue.
I am looking to compare two values (like greater than or less than from other) in HashMap, Hashtable, Map or any other Array types.
Could you please help me this.
Here are some factors that would affect your selection of a data structure:
What is the purpose of the comparison?
What type of data are you comparing?
How often will data be inserted into this data structure?
How often will data be selected from this data structure?
When should you use a HashMap?
One should use HashMap when their major requirements are only retrieving or modifying data's based on Key. For example, in Web
Applications username is stored as a key and user data is stored as a
value in the HashMap, for faster retrieval of user data corresponding
to a username.
HashMap
When should you use a HashTable?
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by
someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can
be wasteful and slow when excessive memory's used), the majority of
implementations will outgrow the arrays they're using for the hash
table every now and then, then allocate a bigger array and copy
content across: this can make the specific insertions that cause this
rehashing to be much slower than the normal O(1) behaviour, even
though the average is still O(1); if you need more consistent
behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating
on elements with keys that are "nearby" in some specific sort order),
such that cache efficiency is better for other storage models that
keep them nearby in memory (e.g. bucket sorted elements), even if
you're not exactly relying on the sort order for e.g. iteration
HashTable
I have a collection of objects(max 500).
My entries will be looked up frequently based on a MAC like key, whose range is unknown.
Now, I am confused as to which data structure and algorithm to use for effective look up of values.
I am not sure whether to go for a balanced BST (AVL) or a hashtable for this case.
Are 500 keys small for building hashtables?
What would be the best approach in my case?
I read that computing hash might prove costly when the number of keys is less
On a side note, I would also like to know what number of entries (min) need to be present for considering a hashtable?
Please add a comment if further details are needed.
Thanks in advance.
Below are some of the benefits of Hash structures,
Fast lookup (O(1) theoretically)
Efficient storage (helps to store key-value)
Though these properties are beneficial but in some scenarios hash table can underperform.
If you have large amount of objects then more storage space (Memory) will be required and thus can cause performance hit
Hashing/key algorithm should not be complex. Otherwise more time will spent on hashing and finding key.
Key collision should be minimal to avoid linear search in all values for single key or key duplication.
In your case if hashing algo is not too complex then definitely you can use hashtable as you only 500 objects. If you have lookup intensive workflow then hashtable can save lot of time. If your data is nearly static then don't worry about initial loading time because your lookup time will be much faster.
You can also look at another DS which are efficient for less values like, Hash set, AVL trees, Hash tree. For 500 objects time diff will be milli or micro second diff in linear search and hash search. thus you wont achieve much perf. improvement. Thus look for easiness and readability.
I have 232 4-byte keys that I'm hashing; what's the chance of collision?
What if I have 264 8-byte keys (not really storing every key, but I want to know the worst case)?
Per the wikipedia page on the Birthday Problem, a good first order approximation can be found with 1-e^(-(n^2)/d). Graphing this for your values gives this graph (logarithmic horizontal axis, I've zoomed in on where the probability starts to spike). Note that this is only an approximation, and should be considered conservatively (ie, the real probability may be somewhat higher, but it should be in the right ballpark).
What are you doing with the hash codes? If you're using them to work out whether two pieces of data are the same, an MD5 hash is pretty good, though only if you are working with data that is not being created by malicious entities. (Cryptographic purposes need better hash algorithms precisely in order to deal with the "malicious attacker" problem.)
If you're using them for building a map (i.e., you're building a hash table) it's usually better to use a cheap hash and come up with a way to mitigate the costs of collision (e.g., by hanging a linked list off the hash table and resizing/rebuilding when the average weight gets too large).
Hashtables seem to be preferable in terms of disk access. What is the real reason that indexes usually implemented with a tree?
Sorry if it's infantile, but i did not find the straight answer on SO.
One of the common actions with data is to sort it or to search for data in a range - a tree will contain data in order while a hash table is only useful for looking up a row and has no idea of what the next row is.
So hash tables are no good for this common case, thanks to this answer
SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000
or
SELECT * FROM MyTable ORDER BY x
Obviously there are cases where hash tables are better but best to deal with the main cases first.
Size, btrees start small and perfectly formed and grow nicely to enormous sizes. Hashes have a fixed size which can be too big (10,000 buckets for 1000 entries) or too small (10,000 buckets for 1,000,000,000 entries) for the amount of data you have.
Hash tables provide no benefit for this case:
SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000
One has to only look at MySQL's hash index implementation associated with MEMORY storage engine to see its disadvantages:
They can be used with equality operators such as = but not with comparison operators such as <
The optimizer cannot use a hash index to speed up ORDER BY operations.
Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
Optimizer cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use).
And note that the above applies to hash indexes implemented in memory, without the added consideration of disk access matters associated with indexes implemented on disk.
Disk access factors as noted by #silentbicycle would skew it in favour of the balanced-tree index even more.
Databases typically use B+ trees (a specific kind of tree), since they have better disk access properties - each node can be made the size of a filesystem block. Doing as few disk reads as possible has a greater impact on speed, since comparatively little time is spent on either chasing pointers in a tree or hashing.
Hasing is good when the data is not increasing, more techically when N/n is constant ..
where N = No of elements and n = hash slots ..
If this is not the case hashing doesnt give a good performance gain.
In database most probably the data would be increasing a significant pace so using hash there is not a good idea.
and yes sorting is there too ...
"In database most probably the data would be increasing a significant pace so using hash there is not a good idea."
That is an over-exaggeration of the problem. Yes hash spaces must be fixed in size (modulo solutions ala extensible hashing) and yes, their size must be managed, and yes, someone must do that job.
That said, the performance gains if you exploit hash-based physical location to its fullest potential, are enormous.