I have a set of ~2000 monotonic large integers (32-bit) which must serve as keys to a hash table. How can I take advantage of this constraint to efficiently hash them?
How can I take advantage of this constraint (monotonic) to efficiently hash them?
Given that the keys are sorted (monotonic) is unlikely to aid in any hashing as hashing, in general, attempts to defeat the ordering of keys.
Hashing chops up any key is a seemingly non-ordered fashion.
Not only are keys and related data needed to be added to a hash table, access (simply reads) to the hash table is done through keys which are certainly not sorted.
If original keys are sorted and access is sequential, then a hash table should not be used in the first place.
Related
According to RSBMS theory, when choosing a primary key, we are supposed to choose amongst minimal superkeys, effectively optimizing our key choice w.r.t # of columns.
Why are we interested in optimizing against # of columns instead of number of bytes? Wouldn't a smaller byte size PK result in smaller index tables and overall more read/write time efficient queries? For example, choosing a PK comprised of 2 varchar(16) rather than 1 varchar(64).
I think I agree with you.
I don't think theory accounts for physical storage.
Yes, if for instance, you created a column which was a SHA256 of two small columns, say VARCHAR(16), then yes the nodes of the B-tree in the index would take up more space, and the index would not be faster than indexing the two 16 byte columns.
There is some efficiency lost building an index which matches on the first column, and has to switch to comparisons on the second column. The b-nodes are more efficient if the whole b-node is comparing on the same column.
Honestly though, I don't think either amounts to much difference in efficiency. I think the statement is RDBMS theory not accounting for storage size.
The identification of minimal rather than non-minimal superkeys is very important when defining keys in a database. If you choose to enforce uniqueness on three columns, A,B,C then that's very different from enforcing uniqueness on just two columns, A,B. A uniqueness constraint on A,B,C would not guarantee the uniqueness of A,B - so A,B would no longer be a superkey. On the other hand if the uniqueness constraint is on A,B then A,B,C is also a superkey. So it's essential from a data integrity point of view to know what the irreducible set of superkeys is.
This has nothing to do with primary keys as such because all keys must be minimal, not just the one you choose to call primary. Storage size and performance are something else. Internal storage is an important consideration in the design of indexes but size and performance are non-functional requirements whereas keys are all about logic and functionality.
From Database System Concepts
We use the term hash index to denote hash file structures as well as
secondary hash indices. Strictly speaking, hash indices are only
secondary index structures.
A hash index is never needed as a clustering index structure, since, if a file itself is organized by hashing, there is no need for a
separate hash index structure on it. However, since hash file
organization provides the same direct access to records that indexing
provides, we pretend that a file organized by hashing also has a
clustering hash index on it.
Is "secondary index" the same concept as "nonclustering index" (which is what I understood from the book)?
Is a hash index never a clustering index or not?
Could you rephrase or explain why the reason "A hash index is never needed as a clustering index structure" is "if a file itself is organized by hashing, there is no need for a separate hash index structure on it"? What about "if a file itself is not organized by hashing"?
Thanks.
The text tries to explain something but unfortunately creates more confusion than it resolves.
At the logical level, database tables (correct term : "relations") are made up of rows (correct term : "tuples") which represent facts about the real world the db is aimed to represent/reflect. Don't ever call those rows/tuples "records" because "records" is a concept pertaining to the physical level, which is distinct from the logical.
Typically, but this is not a universal law cast in stone, you will find that the physical organization consists of a "main" datastore which has a record for each tuple and where that record contains each and every attribute (column) value of the tuple (row). (That's unless there are LOBs in play or so.) Those records must be given a physical location in the store they are stored in and this is usually/typically done using a B-tree on the primary key values. This facilitates :
retrieving only specific [tuples/rows with] primary key values from the relation/table.
traversing the [tuples of] relation in-order of primary key values
retrieving only [tuples/rows within] specific ranges of primary key values from the relation/table.
This B-tree on the primary key values is typically called the "clustering" index.
Often, there is also a frequent need for retrieving only [tuples/rows with] specific values of attributes that are not the primary key. If that needs to be done as efficiently/fast as it can for values of the primary key, we use similar indexes that are then sometimes called "secondary". Those indexes typically do not contain all the attribute/column values of the tuple/row indexed, but only the attribute values to be indexed plus a mention of the primary key value (so we can find the rest of the attributes in the "main" datastore.
Those "secondary" indexes will mostly also be B-tree indexes which will permit in-order traversal for the attributes being indexed, but they can potentially also be hashing indexes, which permit only to look up tuples/rows using equality comparisons with a given key value ("key" = index key, nothing to do with the keys on the relation/table, though obviously for most keys on the table/relation, there will be a dedicated index too where the index key has the same attributes as the table key it supports).
Finally, there is no theoretical reason why a "primary" (/"clustered") index could not be a hash index (the text kinda suggests the opposite but that is plain wrong). But given the poor level of the explanation in your textbook, it is probably not expected of you to be taught that.
Also note that there are still other ways to physically organize a database than just using B-tree or hash indexes.
So to sum up :
"Clustered" usually refers to the index on the primary data records store
and is usually a B-tree [or some such] on the primary key
and the textbook presumably does not want you to know about more advanced possibilities
"Secondary" usually refers to additional indexes that provide additional "fast access to specific tuples/rows"
and is usually also a B-tree that permits in-order traversal just like the "clustered"/"primary" index
but can also be a hash index that permits only "access by given value" but no in-order traversal.
Hope it helps.
I will try to oversimplify just to point where your confusion is.
There are different type of index organisations:
Clustered
Non Clustered
Each of them may use one of the following file structures:
Sequential File organisation
Hash file organisation
We can have clustered indexes and non clustered indexes using hash file organisations.
Your text book is supposing that clustered indexes are used only on primary keys.
It also supposes that hash indexes, which I suppose is referring to a non-clustered index using hash file organisation, are only used for secondary indexes (non primary-key fields).
But you can actually have clustered indexes on primary keys and non-primary keys. Maybe it is a simplification done for the sake of comprehension, or it is based on a specific implementation of a DB.
Wikipedia says:
First prepare a hash table of the smaller relation. The hash table
entries consist of the join attribute and its row. Because the hash
table is accessed by applying a hash function to the join attribute,
it will be much quicker to find a given join attribute's rows by using
this table than by scanning the original relation.
It appears as if speed of this join algorithm is due to that we hash R(lesser sized relation) but not S(other, larger one).
My question is how do we compare hashed versions of R's rows to S without running the hash function on S as well? Do we presume DB stores one for us?
Or am I wrongly assuming about not hashing S, and speed advantage is due to comparing hashes(unique, small) as opposed to reading through actual data of the rows(not unique, might be large)?
The hash function will also be used on the join attribute in S.
I think that the meaning of the quoted paragraph is that applying the hash function on the attribute, finding the correct hash bucket and following the linked list will be faster than searching for the corresponding row of the table R with a table or index scan.
The trade-off for this speed gain is the cost of building the hash.
I've a table which I need to give unique constraint to multiple columns. But instead of creating multi column unique index, I can also introduce an extra column based on hashing of all the required fields. So which one will be more effective in terms of database performance?
MySQL suggests the hashed column method but I couldn't find any information regarding SqlServer.
The link you give states:
If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns.
So the performance improvement really relies on the indexed hash being quite a bit smaller than the combined multiple columns. This could easily not be the case, given that an MD5 is 16 bytes. I'd consider how much wider the average index key would be for the multi-columnindex, and to be honest I'd probably not bother with the hash anyway.
You could, if you feel inclined, benchmark your system with both approaches. And if the potential benefits don't tempt you into trying that, again I'd not bother.
I've used the technique more often for change detection, where checking for a change in 100 separate columns of a table row is much more compute intensive than comparing two hashes.
I need to create a hash key on my tables for uniqueness and someone mentioned to me about md5. But I have read about checksum and binary sum; would this not serve the same purpose? To ensure no duplicates in a specific field.
Now I managed to implement this and I see the hask keys in my tables.
Do I need to alter index keys originally created since I created a new index key with these hash keys? Also do I need to change the keys?
How do I change my queries for example SELECT statements?
I guess I am still unsure how hash keys really help in queries other than uniqueness?
If your goal is to ensure no duplicates in a specific field, why not just apply a unique index to that field and let the database engine do what it was meant to do?
It makes no sense to write a unique function to replace SQL Server unique constraints/indexes.
How are you going to ensure the hash is unique? With a constraint?
If you index it (which may not be allowed because of determinism), then the optimiser will treat it as non-unique. As well as killing performance.
And you only have a few 100,000 rows. Peanuts.
Given time I could come up with more arguments, but I'll summarise: Don't do it
There's always the HashBytes() function. It supports md5, but if you don't like it there's an option for sha1.
As for how this can help queries: one simple example is if you have a large varchar column — maybe varchar max — and in your query you want to know if the contents of this column match a particular string. If you have to compare your search with every single record it could be slow. But if you hash your search string and use that, things can go much faster since now it's just a very short binary compare.
Cryptographically save Hash functions are one way functions and they consume more resources (CPU cycles) that functions that are not cryptographically secure. If you just need function as hash key you do not need such property. All you need is low probability for collisions what is related whit uniformity. Try whit CRC or if you have strings or modulo for numbers.
http://en.wikipedia.org/wiki/Hash_function
why don't you use a GUID with a default of NEWSEQUENTIALID() ..don't use NEWID() since it is horrible for clustering, see here: Best Practice: Do not cluster on UniqueIdentifier when you use NewId
make this column the primary key and you are pretty much done