hash functions-sql studio express

hash functions-sql studio express - sql-server

I need to create a hash key on my tables for uniqueness and someone mentioned to me about md5. But I have read about checksum and binary sum; would this not serve the same purpose? To ensure no duplicates in a specific field.
Now I managed to implement this and I see the hask keys in my tables.
Do I need to alter index keys originally created since I created a new index key with these hash keys? Also do I need to change the keys?
How do I change my queries for example SELECT statements?
I guess I am still unsure how hash keys really help in queries other than uniqueness?

If your goal is to ensure no duplicates in a specific field, why not just apply a unique index to that field and let the database engine do what it was meant to do?

It makes no sense to write a unique function to replace SQL Server unique constraints/indexes.
How are you going to ensure the hash is unique? With a constraint?
If you index it (which may not be allowed because of determinism), then the optimiser will treat it as non-unique. As well as killing performance.
And you only have a few 100,000 rows. Peanuts.
Given time I could come up with more arguments, but I'll summarise: Don't do it

There's always the HashBytes() function. It supports md5, but if you don't like it there's an option for sha1.
As for how this can help queries: one simple example is if you have a large varchar column — maybe varchar max — and in your query you want to know if the contents of this column match a particular string. If you have to compare your search with every single record it could be slow. But if you hash your search string and use that, things can go much faster since now it's just a very short binary compare.

Cryptographically save Hash functions are one way functions and they consume more resources (CPU cycles) that functions that are not cryptographically secure. If you just need function as hash key you do not need such property. All you need is low probability for collisions what is related whit uniformity. Try whit CRC or if you have strings or modulo for numbers.
http://en.wikipedia.org/wiki/Hash_function

why don't you use a GUID with a default of NEWSEQUENTIALID() ..don't use NEWID() since it is horrible for clustering, see here: Best Practice: Do not cluster on UniqueIdentifier when you use NewId
make this column the primary key and you are pretty much done

Related

Is a unique column good as partition key in Cassandra?

I have a table user with multiple columns, every user has a unique userid.
Because it is unique, I dont have to specify a clustering key unless I want to use the column in queries. Is this bad, because every partition consists of a single row? If it is bad for whatever reason, what is the best practise to do in this case?
Thank you for your help!
Edit: If I have a query that needs to return all usernames, how can I do that with a good performance? Doing it from this table seems not very efficient for me, should I make another table where I simply duplicate all usernames in a Collection? Then they are all in one place and the read doesn't have to jump over multiple nodes.

I just answered the similar question. Short story - it really depends on the access patterns, and table settings. You may need to tune the table parameters to get best performance, but the settings may depend on the amount of data, and other requirements.

There are always two (main) considerations when defining your primary keys in Cassandra:
Data distribution
Query pattern match
From a data distribution standpoint, you can't get much better than using a unique key as the partition key. The more of them, the more evenly they should hash-out and thus be evenly distributed.
However, a key which distributes well but doesn't fit the desired query pattern, is pretty useless.
tl;dr;
If that unique key is all you'll ever query the table by, then it makes a great choice for a partition key.

Is there a way to create a unique constraint on a column larger than 900 bytes?

I'm fairly new to SQL Server, so if anything I say doesn't make sense, there's a good chance I'm just confused by something. Anyway...
I have a simple mapping table. It has two columns, Before and After. All I want is a constraint that the Before column is unique. Originally it was set to be a primary key, but this created errors when the value was too large. I tried adding an ID column as a primary key and then adding UNIQUE to the Before column, but I have the same problem with the max length exceeding 900 bytes (I guess the constraint creates an index).
The only option I can think of is too change the id column to a checksum column and make that the primary key, but I dislike this option. Is there a different way to do this? I just need two simple columns.

The only way I can think of to guarantee uniqueness inside the database is to use an INSTEAD OF trigger. The link I provided to MSDN has an example for checking uniqueness. This solution will most likely be quite slow indeed, since you won't be able to index on the column being checked.
You could speed it up somewhat by using a computed column to create a hash, perhaps using the HASHBYTES function, of the Before column. You could then create a non-unique index on that hash column, and inside your trigger check for the negative case -- that is, check to see if a row with the same hash doesn't exist. If that happens, exit the trigger. In the case there is another row with the same hash, you could then do the more expensive check for an exact duplicate, and raise an error if the user enters a duplicate value. You also might be able to simplify your check by simply comparing both the hash value and the Before value in one EXISTS() clause, but I haven't played around with the performance of that solution.
(Note that the HASHBYTES function I referred to itself can hash only up to 8000 bytes. If you want to go bigger than that, you'll have to roll your own hash function or live with the collisions caused by the CHECKSUM() function)

Could a hashing algorithm be used to save space in a database?

I have been using git a lot recently and I quite like the concept of how GIT avoid duplicating similar data by using a hashing function based on sha1. I was wondering if current databases do something similar, or is this inefficient for some reason?

There is no need for this. Databases already have a good way of avoiding duplicating data - database normalization.
For example imagine you have a column that can contain one of five different strings. Instead of storing one of these strings into each row you should move these string out into a separate table. Create a table with two columns, one with the strings values and the other as a primary key. You can now use a foreign key in your original table instead of storing the whole string.

I came up with a nice "reuse-based-on-hash" technique (it's probably widely used though)
I computed the hash-code of all fields in the row, and then I used this hash-code as primary key.
When I inserted I simply did "INSERT IGNORE" (to suppress errors about duplicate primary keys). Either way I could be sure that what I wanted to insert, was present in the database after insertion.
If this is a known concept I'd be glad to hear about it!

Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?

For SQL server is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?

That depends on what you're doing:
If speed is the primary concern then a plain old int is probably big enough.
If you really will have more than 2 billion (with a B ;) ) records, then use bigint or a sequential guid.
If you need to be able to easily synchronize with records created remotely, then Guid is really great.
Update
Some additional (less-obvious) notes on Guids:
They can be hard on indexes, and that cuts to the core of database performance
You can use sequential guids to get back some of the indexing performance, but give up some of the randomness used in point two.
Guids can be hard to debug by hand (where id='xxx-xxx-xxxxx'), but you get some of that back via sequential guids as well (where id='xxx-xxx' + '123').
For the same reason, Guids can make ID-based security attacks more difficult- but not impossible. (You can't just type 'http://example.com?userid=xxxx' and expect to get a result for someone else's account).

In general I'd recommend a BIGINT over a GUID (as guids are big and slow), but the question is, do you even need that? (I.e. are you doing replication?)
If you're expecting less than 2 billion rows, the traditional INT will be fine.

Are you doing replication or do you have sales people who run disconnected databses that need to merge, use a GUID. Otherwise I'd go for an int or bigint. They are far easier to deal with in the long run.

Depends no what you need. DB Performance would gain from integer while GUIDs are useful for replication and not requiring to hear back from DB what identity has been created, i.e. code could create GUID identity before inserting into row.

If you're planning on using merge replication then a ROWGUIDCOL is beneficial to performance (see here for info). Otherwise we need more info about what your definition of 'better' is; better for what?

Unless you have a real need for a GUID, such as being able to generate keys anywhere and not just on the server, then I would stick with using INTEGER-based keys. GUIDs are expensive to create and make it harder to actually look at the data. Plus, have you ever tried to type a GUID in an SQL query? It's painful!

There can be few more aspects or requirements to use GUID.
If the primary key is of any numeric type (Int, BigInt or any other), then either you need to make it Identity column, or you need to check the last saved value in the table.
And in that case, if the record in foreign table is saved as transaction, then it would be difficult to get the last identity value of primary key. Like if IDENT_CURRENT is used, then will be again effect performance while saving record in foreign key.
So in case of saving the records as for transactions, then it would be convenient to firstly generate Guid for primary key, and then save the generated key (Guid) in primary and foreign table(s).

It really depends on whether or not the information coming in is somehow sequential. I highly recommend for things such as users that a GUID might be better. But for sequential data, such as orders or other things that need to be easily sortable that a bigint may well be a better solution as it will be indexed and provide fast sorting without the cost of another index.

It really depends whether you're expecting to have replication in the picture. Replication requires a row UUID, so if you're planning on doing that you may as well do it up front.

I'm with Andrew Rollings.
Now you could argue space efficiency. An int is what, 8 bytes max? A guid is going to much longer.
But I have two main reasons for preference: readability and access time. Numbers are easier for me than GUIDs (since I can always find the next/previous record easily).
As for access time, note that some DBs can start to have BIG problems with GUIDs. I know this is the case with MySQL (MySQL InnoDB Primary Key Choice: GUID/UUID vs Integer Insert Performance). This may not be much of a problem with SQL Server, but it's something to watch out for.
I'd say stick with INT or BIGINT. The only time I would think you'd want the GUID is when you are going to give them out and don't want people to be able to guess the IDs of other records for security reasons.

Should key values in a database table be hashed?

Suppose a database table has a column "Name" which is defined as key for the table.
Usual name values will be "Bill", "Elizabeth", "Bob", "Alice". Lookups on the table will be done by the name key as well.
Does hashing the values optimize the operations in any way? i.e. entering each name as some hashed value of the name (suppose MD5 - 32 bits).
If so - shouldn't this be a feature of the database and not something the client handles?

Assuming your database generates an index for the primary key (and I can't imagine it wouldn't) it's doing it for you. So yes, it should absolutely be something that the database handles.

"Does hashing the values optimize the operations in any way? " Not really.
Hashes are one-way. You can't do a table scan and reconstruct the original name.
If you want to keep both name and hash-of-name, you've broken a fundamental rule by including derived data. Now a name update requires a hash update.
The "spread my values around evenly" that would happen with a hash, is the job of an index.

No, don't hash them. Your database will build an index based on the data, and hashing won't help. The only time it might help is if your key values were much longer than the hash.