I'm trying to understand the probability of collision of new hashes, given no collisions in the existing hash table yet.
For illustration, let's say I have a table where I store hashes of each row.
The table currently has 1 billion rows
There are no hash collisions amongst those 1 billion rows.
I'm using a 64-bit hash algorithm.
Now imagine I insert 10 million new rows of data into the table. What is the probability that I have a hash collision now? I think the answer is the following:
Each new row's hash cannot have the same value of any of the existing rows or the new ones processed before itself. That removes 1 billion hash values from the 2^64 possibilities, so the probability of new collisions should be:
Does that sound right?
Thanks to President James K. Polk, I realized that my original solution was wrong. The probability of no collisions is
Another way to think of it is just using the definition of conditional probability.
...which reduces to...
...which can be reduced to the product formula.
The benefit of the conditional probability formula is that it can be easily estimated using any of the online hash collision probability calculators.
Related
Wikipedia says:
First prepare a hash table of the smaller relation. The hash table
entries consist of the join attribute and its row. Because the hash
table is accessed by applying a hash function to the join attribute,
it will be much quicker to find a given join attribute's rows by using
this table than by scanning the original relation.
It appears as if speed of this join algorithm is due to that we hash R(lesser sized relation) but not S(other, larger one).
My question is how do we compare hashed versions of R's rows to S without running the hash function on S as well? Do we presume DB stores one for us?
Or am I wrongly assuming about not hashing S, and speed advantage is due to comparing hashes(unique, small) as opposed to reading through actual data of the rows(not unique, might be large)?
The hash function will also be used on the join attribute in S.
I think that the meaning of the quoted paragraph is that applying the hash function on the attribute, finding the correct hash bucket and following the linked list will be faster than searching for the corresponding row of the table R with a table or index scan.
The trade-off for this speed gain is the cost of building the hash.
I have a set of ~2000 monotonic large integers (32-bit) which must serve as keys to a hash table. How can I take advantage of this constraint to efficiently hash them?
How can I take advantage of this constraint (monotonic) to efficiently hash them?
Given that the keys are sorted (monotonic) is unlikely to aid in any hashing as hashing, in general, attempts to defeat the ordering of keys.
Hashing chops up any key is a seemingly non-ordered fashion.
Not only are keys and related data needed to be added to a hash table, access (simply reads) to the hash table is done through keys which are certainly not sorted.
If original keys are sorted and access is sequential, then a hash table should not be used in the first place.
I've a table which I need to give unique constraint to multiple columns. But instead of creating multi column unique index, I can also introduce an extra column based on hashing of all the required fields. So which one will be more effective in terms of database performance?
MySQL suggests the hashed column method but I couldn't find any information regarding SqlServer.
The link you give states:
If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns.
So the performance improvement really relies on the indexed hash being quite a bit smaller than the combined multiple columns. This could easily not be the case, given that an MD5 is 16 bytes. I'd consider how much wider the average index key would be for the multi-columnindex, and to be honest I'd probably not bother with the hash anyway.
You could, if you feel inclined, benchmark your system with both approaches. And if the potential benefits don't tempt you into trying that, again I'd not bother.
I've used the technique more often for change detection, where checking for a change in 100 separate columns of a table row is much more compute intensive than comparing two hashes.
Is there any reason to put an index on a column, which is commonly used in a WHERE statement, or a JOIN, on a table that has less than 1000 rows? I am being asked, as a standard, for a project we're working on, to apply an index on all columns where a WHERE is being used.
I can understand the use of this on large tables, however, the overhead of the index on smaller tables seems - useless. Is there any benefit or detriment for adding indexes, willy-nilly?
Indices are there to avoid table scans - which lead to locks. Even a 1000 row table will get a lot less throughput without an index and given that no index is available in joins this will lead to certain constructs being favoured which you will not like in loops.
Database index is only for large table that is correct. But when your small table is joining with other big table then without indexes it might be slow. At least one unique cluster index is
Good practice to create in small table.
This is also depends on what is your search, like text, search and join slow then numbers.
my advice is to create one primary key cluster index on small table and other column index is not require if you are sure about less frequency of data.
I'm trying to choose between these query plans for a range query:
Sequential table scan
Bitmap index
B+ tree index
Hash index
My instinct is that a bitmap index would work here based on what I've read. Does that sound right?
This link has a pretty good explanation: http://dylanwan.wordpress.com/2008/02/01/bitmap-index-when-to-use-it/
And of course wikipedia: http://en.wikipedia.org/wiki/Bitmap_index
In short, it depends on the percentage of unique values to the total number of rows. If you have only a few unique values, the bitmap index is probably the way to go.