Suppose a database table has a column "Name" which is defined as key for the table.
Usual name values will be "Bill", "Elizabeth", "Bob", "Alice". Lookups on the table will be done by the name key as well.
Does hashing the values optimize the operations in any way? i.e. entering each name as some hashed value of the name (suppose MD5 - 32 bits).
If so - shouldn't this be a feature of the database and not something the client handles?
Assuming your database generates an index for the primary key (and I can't imagine it wouldn't) it's doing it for you. So yes, it should absolutely be something that the database handles.
"Does hashing the values optimize the operations in any way? " Not really.
Hashes are one-way. You can't do a table scan and reconstruct the original name.
If you want to keep both name and hash-of-name, you've broken a fundamental rule by including derived data. Now a name update requires a hash update.
The "spread my values around evenly" that would happen with a hash, is the job of an index.
No, don't hash them. Your database will build an index based on the data, and hashing won't help. The only time it might help is if your key values were much longer than the hash.
Related
Is there any benefit to storing content alphabetic in columns? Maybe make lookups faster? If yes then when i add new lookup values to my tables do i need to rebuild the PK for the lookup values to fit in the new text? Say a table like this:
City_tbl
city_id: example: 1120
City_name: example: New York.
If I need to add Chicago to it, do i add it at the bottom of the list with the next ID which may be 2000 or do i inset it after the city in alphabetic order which would mean I need to update the PK Id of all following IDs by 1.
Only benefit I know about is when I have to manually add lookup values without querying the database I can quickly check the lookup value list for exiting items with ease. But not sure if it may make lookups faster or something if the system knows the text is in alphabetical order.
No, I see no value in it. Better to use a proper primary key and add an index to the column. The people who have spent years writing relational databases know how to optimize access far better than you do.
I'd make the PK column auto increment, leaving the updating to the database. I'd add an index to the city name column so you can search by name as quickly as possible.
You're presuming that you understand something about the physical storage of the database. At best, your efforts will have no effect; at worst, you'll screw up the fast access that a properly indexed b-tree will already give you.
I have been using git a lot recently and I quite like the concept of how GIT avoid duplicating similar data by using a hashing function based on sha1. I was wondering if current databases do something similar, or is this inefficient for some reason?
There is no need for this. Databases already have a good way of avoiding duplicating data - database normalization.
For example imagine you have a column that can contain one of five different strings. Instead of storing one of these strings into each row you should move these string out into a separate table. Create a table with two columns, one with the strings values and the other as a primary key. You can now use a foreign key in your original table instead of storing the whole string.
I came up with a nice "reuse-based-on-hash" technique (it's probably widely used though)
I computed the hash-code of all fields in the row, and then I used this hash-code as primary key.
When I inserted I simply did "INSERT IGNORE" (to suppress errors about duplicate primary keys). Either way I could be sure that what I wanted to insert, was present in the database after insertion.
If this is a known concept I'd be glad to hear about it!
I need to create a hash key on my tables for uniqueness and someone mentioned to me about md5. But I have read about checksum and binary sum; would this not serve the same purpose? To ensure no duplicates in a specific field.
Now I managed to implement this and I see the hask keys in my tables.
Do I need to alter index keys originally created since I created a new index key with these hash keys? Also do I need to change the keys?
How do I change my queries for example SELECT statements?
I guess I am still unsure how hash keys really help in queries other than uniqueness?
If your goal is to ensure no duplicates in a specific field, why not just apply a unique index to that field and let the database engine do what it was meant to do?
It makes no sense to write a unique function to replace SQL Server unique constraints/indexes.
How are you going to ensure the hash is unique? With a constraint?
If you index it (which may not be allowed because of determinism), then the optimiser will treat it as non-unique. As well as killing performance.
And you only have a few 100,000 rows. Peanuts.
Given time I could come up with more arguments, but I'll summarise: Don't do it
There's always the HashBytes() function. It supports md5, but if you don't like it there's an option for sha1.
As for how this can help queries: one simple example is if you have a large varchar column — maybe varchar max — and in your query you want to know if the contents of this column match a particular string. If you have to compare your search with every single record it could be slow. But if you hash your search string and use that, things can go much faster since now it's just a very short binary compare.
Cryptographically save Hash functions are one way functions and they consume more resources (CPU cycles) that functions that are not cryptographically secure. If you just need function as hash key you do not need such property. All you need is low probability for collisions what is related whit uniformity. Try whit CRC or if you have strings or modulo for numbers.
http://en.wikipedia.org/wiki/Hash_function
why don't you use a GUID with a default of NEWSEQUENTIALID() ..don't use NEWID() since it is horrible for clustering, see here: Best Practice: Do not cluster on UniqueIdentifier when you use NewId
make this column the primary key and you are pretty much done
I'm trying to make a judgment call in implementing a small-ish SQL Server '08 database.
I'm translating an output text file of a flat-file database from an old COBOL system to the aforementioned SQL Server database. It's a database of vehicle and real estate loans, which can be uniquely identified by the combination of a Lender ID (a seven-digit number), bank account number (15 digits), and "account suffix" (two digits).
I confess I'm pretty naive when it comes to database administration (to be honest, I've not really done it up until my current position), and I'm trying to determine which of two approaches are my best option for implementing a key which will index into several other tables:
1) Identify each loan using a three-column key of the above values, or
2) Denormalize the data by implementing a "key" column which is a 24-character string combining the three values.
The denormalization is ugly, granted, but I can't anticipate update anomalies occurring, since loans can't be passed back and forth between banks or change their loan suffix. A change in those values is guaranteed to be a different account.
A compound key is more elegant, but I've read a few treatises suggesting that it's a Bad Thing.
So, which option is likely to be a better choice, and more importantly, why?
I would use an autogenerated surrogate key and then put a unique index on the natural key. This way if the natural key changes (and it might if say a a bank got bought out by another bank), then it only needs to change in one place. The most importatnt thing in using a surrogate key is to ensure uniqueness of the natural key if one exiusts and the unique index will do that.
If this is reference data that won't be updated often, then using the multi-part key should be fine.
If this is high-traffic transactional data, then add a surrogate key (int identity, clustered primary key) and make the three-part key an alternate key.
I would not suggest implementing option 2 at all.
I would suggest just using an auto-incrementing numeric surrogate key. Why would it need to be a mashup of the other three "key" columns?
Is there a performance gain or best practice when it comes to using unique, numeric ID fields in a database table compared to using character-based ones?
For instance, if I had two tables:
athlete
id ... 17, name ... Rickey Henderson, teamid ... 28
team
teamid ... 28, teamname ... Oakland
The athlete table, with thousands of players, would be easier to read if the teamid was, say, "OAK" or "SD" instead of "28" or "31". Let's take for granted the teamid values would remain unique and consistent in character form.
I know you CAN use characters, but is it a bad idea for indexing, filtering, etc for any reason?
Please ignore the normalization argument as these tables are more complicated than the example.
I find primary keys that are meaningless numbers cause less headaches in the long run.
Text is fine, for all the reasons you mentioned.
If the string is only a few characters, then it will be nearly as small an an integer anyway. The biggest potential drawback to using strings is the size: database performance is related to how many disk accesses are needed. Making the index twice as big, for example, could create disk-cache pressure, and increase the number of disk seeks.
I'd stay away from using text as your key - what happens in the future when you want to change the team ID for some team? You'd have to cascade that key change all through your data, when it's the exact thing a primary key can avoid. Also, though I don't have any emperical evidence, I'd think the INT key would be significantly faster than the text one.
Perhaps you can create views for your data that make it easier to consume, while still using a numeric primary key.
I'm just going to roll with your example. Doug is correct when he says that text is fine. Even for a medium sized (~50gig) database having a 3 letter code be a primary key won't kill the database. If it makes development easier, reduces joins on the other table and it's a field that users would be typing in...I say go for it. Don't do it if it's just an abbreviation that you show on a page or because it makes the athletes table look pretty. I think the key is the question "Is this a code that the user will type in and not just pick from a list?"
Let me give you an example of when I used a text column for a key. I was making software for processing medical claims. After the claim got all digitized a human had to look at the claim and then pick a code for it that designated what kind of claim it was. There were hundreds of codes...and these guys had them all memorized or crib sheets to help them. They'd been using these same codes for years. Using a 3 letter key let them just fly through the claims processing.
I recommend using ints or bigints for primary keys. Benefits include:
This allows for faster joins.
Having no semantic meaning in your primary key allows you to change the fields with semantic meaning without affecting relationships to other tables.
You can always have another column to hold team_code or something for "OAK" and "SD". Also
The standard answer is to use numbers because they are faster to index; no need to compute a hash or whatever.
If you use a meaningful value as a primary key you'll have to update it all through you're database if the team name changes.
To satisfy the above, but still make the database directly readable,
use a number field as the primary key
immediately create a view Athlete_And_Team that joins the Athlete and Team tables
Then you can use the view when you're going through the data by hand.
Are you talking about your primary key or your clustered index? Your clustered index should be the column which you will use to uniquely identify that row by most often. It also defines the logical ordering of the rows in your table. The clustered index will almost always be your primary key, but there are circumstances where they can be differant.