Collisions in a HashTable Data Structure

Collisions in a HashTable Data Structure - c

I need help understanding hashtables, so correct e if I am wrong. A hashtable computes the index of the array with a key by encripting the key. The resultant encryption is the index. Collisions are unavoidable because there are high chances of getting the same index and we can use chaining to create a linked list inside each index of the array. The runtime of a hashtable is o(1)

This is correct, a hash table has O(1) lookup and O(n) storage space.
A hashing function is used to compute the index for where we store the element in a hash table. When a collision occurs we chain them.

Related

Probability of New Hash Collision, Conditional on No Current Collisions

I'm trying to understand the probability of collision of new hashes, given no collisions in the existing hash table yet.
For illustration, let's say I have a table where I store hashes of each row.
The table currently has 1 billion rows
There are no hash collisions amongst those 1 billion rows.
I'm using a 64-bit hash algorithm.
Now imagine I insert 10 million new rows of data into the table. What is the probability that I have a hash collision now? I think the answer is the following:
Each new row's hash cannot have the same value of any of the existing rows or the new ones processed before itself. That removes 1 billion hash values from the 2^64 possibilities, so the probability of new collisions should be:
Does that sound right?

Thanks to President James K. Polk, I realized that my original solution was wrong. The probability of no collisions is
Another way to think of it is just using the definition of conditional probability.
...which reduces to...
...which can be reduced to the product formula.
The benefit of the conditional probability formula is that it can be easily estimated using any of the online hash collision probability calculators.

Is a hash index never a clustering index?

From Database System Concepts
We use the term hash index to denote hash ﬁle structures as well as
secondary hash indices. Strictly speaking, hash indices are only
secondary index structures.
A hash index is never needed as a clustering index structure, since, if a ﬁle itself is organized by hashing, there is no need for a
separate hash index structure on it. However, since hash ﬁle
organization provides the same direct access to records that indexing
provides, we pretend that a ﬁle organized by hashing also has a
clustering hash index on it.
Is "secondary index" the same concept as "nonclustering index" (which is what I understood from the book)?
Is a hash index never a clustering index or not?
Could you rephrase or explain why the reason "A hash index is never needed as a clustering index structure" is "if a ﬁle itself is organized by hashing, there is no need for a separate hash index structure on it"? What about "if a ﬁle itself is not organized by hashing"?
Thanks.

The text tries to explain something but unfortunately creates more confusion than it resolves.
At the logical level, database tables (correct term : "relations") are made up of rows (correct term : "tuples") which represent facts about the real world the db is aimed to represent/reflect. Don't ever call those rows/tuples "records" because "records" is a concept pertaining to the physical level, which is distinct from the logical.
Typically, but this is not a universal law cast in stone, you will find that the physical organization consists of a "main" datastore which has a record for each tuple and where that record contains each and every attribute (column) value of the tuple (row). (That's unless there are LOBs in play or so.) Those records must be given a physical location in the store they are stored in and this is usually/typically done using a B-tree on the primary key values. This facilitates :
retrieving only specific [tuples/rows with] primary key values from the relation/table.
traversing the [tuples of] relation in-order of primary key values
retrieving only [tuples/rows within] specific ranges of primary key values from the relation/table.
This B-tree on the primary key values is typically called the "clustering" index.
Often, there is also a frequent need for retrieving only [tuples/rows with] specific values of attributes that are not the primary key. If that needs to be done as efficiently/fast as it can for values of the primary key, we use similar indexes that are then sometimes called "secondary". Those indexes typically do not contain all the attribute/column values of the tuple/row indexed, but only the attribute values to be indexed plus a mention of the primary key value (so we can find the rest of the attributes in the "main" datastore.
Those "secondary" indexes will mostly also be B-tree indexes which will permit in-order traversal for the attributes being indexed, but they can potentially also be hashing indexes, which permit only to look up tuples/rows using equality comparisons with a given key value ("key" = index key, nothing to do with the keys on the relation/table, though obviously for most keys on the table/relation, there will be a dedicated index too where the index key has the same attributes as the table key it supports).
Finally, there is no theoretical reason why a "primary" (/"clustered") index could not be a hash index (the text kinda suggests the opposite but that is plain wrong). But given the poor level of the explanation in your textbook, it is probably not expected of you to be taught that.
Also note that there are still other ways to physically organize a database than just using B-tree or hash indexes.
So to sum up :
"Clustered" usually refers to the index on the primary data records store
and is usually a B-tree [or some such] on the primary key
and the textbook presumably does not want you to know about more advanced possibilities
"Secondary" usually refers to additional indexes that provide additional "fast access to specific tuples/rows"
and is usually also a B-tree that permits in-order traversal just like the "clustered"/"primary" index
but can also be a hash index that permits only "access by given value" but no in-order traversal.
Hope it helps.

I will try to oversimplify just to point where your confusion is.
There are different type of index organisations:
Clustered
Non Clustered
Each of them may use one of the following file structures:
Sequential File organisation
Hash file organisation
We can have clustered indexes and non clustered indexes using hash file organisations.
Your text book is supposing that clustered indexes are used only on primary keys.
It also supposes that hash indexes, which I suppose is referring to a non-clustered index using hash file organisation, are only used for secondary indexes (non primary-key fields).
But you can actually have clustered indexes on primary keys and non-primary keys. Maybe it is a simplification done for the sake of comprehension, or it is based on a specific implementation of a DB.

Hash join performance origin

Wikipedia says:
First prepare a hash table of the smaller relation. The hash table
entries consist of the join attribute and its row. Because the hash
table is accessed by applying a hash function to the join attribute,
it will be much quicker to find a given join attribute's rows by using
this table than by scanning the original relation.
It appears as if speed of this join algorithm is due to that we hash R(lesser sized relation) but not S(other, larger one).
My question is how do we compare hashed versions of R's rows to S without running the hash function on S as well? Do we presume DB stores one for us?
Or am I wrongly assuming about not hashing S, and speed advantage is due to comparing hashes(unique, small) as opposed to reading through actual data of the rows(not unique, might be large)?

The hash function will also be used on the join attribute in S.
I think that the meaning of the quoted paragraph is that applying the hash function on the attribute, finding the correct hash bucket and following the linked list will be faster than searching for the corresponding row of the table R with a table or index scan.
The trade-off for this speed gain is the cost of building the hash.

Easiest way to hash set of integers?

I have a set of ~2000 monotonic large integers (32-bit) which must serve as keys to a hash table. How can I take advantage of this constraint to efficiently hash them?

How can I take advantage of this constraint (monotonic) to efficiently hash them?
Given that the keys are sorted (monotonic) is unlikely to aid in any hashing as hashing, in general, attempts to defeat the ordering of keys.
Hashing chops up any key is a seemingly non-ordered fashion.
Not only are keys and related data needed to be added to a hash table, access (simply reads) to the hash table is done through keys which are certainly not sorted.
If original keys are sorted and access is sequential, then a hash table should not be used in the first place.

Is there a difference between an array of Linked Lists and a Hash table?

My question is pretty much what's up above. According to the diagrams I see of a hashtable, it seems like for example:
Hashtable temp1 = new Hashtable(20);
SList[] temp2 = new SList[20];
Assuming SList is a singly linked list, isn't temp1 almost the same as temp2? I'm just trying to understand hash tables a little better. Thanks!

With a hash table you insert a value V into a table with a key K. This K can be anything that can be processed by a hash function.
Insertion into SList[], however, must be done using an integer index, which is far less flexible than being able to use any hashable object.
Beside this point there are several ways of dealing with collisions on a key in a hash table. Hashing with chaining produces something which resembles your aforementioned list of linked lists, with the added benefit of indexing by hashable objects instead of ints.
See here under Hashing for a description of others ways of dealing with hash collisions in a hash table.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight