Calculate MD5 for a long string - sql-server

When calling HASHBYTES with long string I am getting
Msg 8152, Level 16, State 10, Line 11
String or binary data would be truncated.
I am trying to calculate the MD5 hash for multiple fields together so I can compare objects,
Is there anyway around this?

Assuming you're using SQL Server 2008 or above, use the CHECKSUM function.
https://msdn.microsoft.com/en-us/library/ms189788.aspx
CHECKSUM computes a hash value, called the checksum, over its list of arguments. The hash value is intended for use in building hash indexes. If the arguments to CHECKSUM are columns, and an index is built over the computed CHECKSUM value, the result is a hash index. This can be used for equality searches over the columns.
CHECKSUM returns an error if any column is of noncomparable data type. Noncomparable data types are text, ntext, image, XML, and cursor, and also sql_variant with any one of the preceding types as its base type.

As #TimBiegeleisen said. SQL Server has an 8k bytes limitation on HASHBYTES.
However, it looks like that SQL Server 2016 and forward don't have this limitation.
For SQL Server 2014 (12.x) and earlier, allowed input values are
limited to 8000 bytes.
https://learn.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-2017

Related

Hash function results on Reshift differs from SQL Server

I have a table hash_table both in Microsoft SQL Server and AWS Redshift.
Redshift
Column Name
Data Type
phone
numeric(8,8)
name
string(2147483647)
SQL Server
Column Name
Data Type
phone
numeric(8,0)
name
nvarchar(80)
I want to extract a hash value from both tables so I can automate the value comparison. But even when I have the same values in both sides, the hash value from each field isn't the same.
I suppose it has sth to do with the data types but I haven't found anything regardig this matter on hash articles.
Am I doing sth wrong?
Here are the functions I've used and them results. At first I tryed with column name but, once the data type differs from each database, I decided using phone:
Redshift
SELECT TOP (1)
len(TelephonyExtension) as PhoneLen,
TelephonyExtension as Phone,
MD5(CONVERT(nvarchar(30), phone)) as Hash
FROM hash_table
Result:
PhoneLen
Phone
Hash
1
1
cfcd208495d565ef62e7dff9f98764fa
SQL Server
SELECT TOP (1)
len(TelephonyExtension) as PhoneLen,
TelephonyExtension as Phone,
HASHBYTES('MD5', CONVERT(nvarchar(30), phone)) as Hash
FROM hash_table
PhoneLen
Phone
Hash
1
1
A46C3B54F2C9871CD91DAF7A932499X0
I have also used sha2_256 instead of MD5 but the problem persists
I expected the hash columns to have the same value in both systems for any type of column.
Hash operates on strings. If the hash is different then the strings are likely different. Add MD5(CONVERT(nvarchar(30), phone)) as Hash to you selects and post if there are differences.
I've done this a few times for clients and getting the strings to match exactly between two DBs can be tricky. Any extra spaces, non-printing chars, or puff of wind can make this mismatch.

Get short hash value from the HASHBYTES('SHA1', text) independent on the SQL Server version?

For the purpose of getting the content-derived key of a longer text, I do calculate HASHBYTES('SHA1', text). It returns 20 bytes long varbinary. As I know the length of the result, I am storing it as binary(20).
To make it shorter (to be used as a key), I would like to follow the Git idea of a short hash -- as if the first (or last) characters of the hexadecimal representation. Instead of characters, I would like to get binary(5) value from the binary(20).
When trying with the SQL Server 2016 it seems that the following simple way:
DECLARE #hash binary(20) = HASHBYTES('SHA1', N'příšerně žluťoučký kůň úpěl ďábelské ódy')
DECLARE #short binary(5) = #hash
SELECT #hash, #short
Returns the leading bytes (higher order bytes):
(No column name) (No column name)
0xE02C3C55FBA0DF13ADA1B626B1E31746D57B4602 0xE02C3C55FB
However, the documentation (https://learn.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-ver15) warns that:
Conversions between any data type and the binary data types are not guaranteed to be the same between versions of SQL Server.
Well, this is not exactly a conversion. Still, does this uncertainty hold also for getting shorter version of binary from the longer version of binary? What should I expect for future versions of SQL Server?

SHA256 base 64 hash generation in SQL Server

I need to generate a SHA256 base 64 hash from a table in SQL server but I can't find that algorithm in the list HASHBYTES arguments.
Is there a way to generate it directly in SQL Server?
Duplicate disclamer:
My question is not duplicate of SHA256 in T-sql stored procedure as I am looking for the SHA256 base 64 version of the algorithm which is not listed in the page.
Numeric Example
I have this query result in SQL Server
Start date,End date,POD,Amount,Currency
2016-01-01,2016-12-31,1234567890,12000,EUR
this give me the following string (using concatenate function)
2016-01-012016-12-31123456789012000EUR
whit this convertion tool I get the following hash
GMRzFNmm90KLVtO1kwTf7EcSeImq+96QTHgnWFFmZ0U
that I need to send to a customer.
First, the generator link you provided outputs the base64 representation in not exactly correct format. Namely, it omits the padding sequence. Though theoretically optional, padding is mandatory in MS SQL Server (tested on 2012 and 2016 versions).
With this in mind, the following code gives you what you need:
declare #s varchar(max), #hb varbinary(128), #h64 varchar(128);
select #s = '2016-01-012016-12-31123456789012000EUR';
set #hb = hashbytes('sha2_256', #s);
set #h64 = cast(N'' as xml).value('xs:base64Binary(sql:variable("#hb"))', 'varchar(128)');
select #hb as [BinaryHash], #h64 as [64Hash];
Apart from the aforementioned padding, there is another caveat for you to look for. Make sure that the input string is always of the same type, that is, either always varchar or always nvarchar. If some of your hashes will be calculated from ASCII strings and some from UTF-16, results will be completely different. Depending on which languages are used in your system, it might make sense to always convert the plain text to nvarchar before hashing.

What are the differences between CHECKSUM() and BINARY_CHECKSUM() and when/what are the appropriate usage scenarios?

Again MSDN does not really explain in plain English the exact difference, or the information for when to choose one over the other.
CHECKSUM
Returns the checksum value computed over a row of a table, or over a list of expressions. CHECKSUM is intended for use in building hash indexes.
BINARY_CHECKSUM
Returns the binary checksum value computed over a row of a table or over a list of expressions. BINARY_CHECKSUM can be used to detect changes to a row of a table.
It does hint that binary checksum should be used to detect row changes, but not why.
Check out the following blog post that highlights the diferences.
http://decipherinfosys.wordpress.com/2007/05/18/checksum-functions-in-sql-server-2005/
Adding info from this link:
The key intent of the CHECKSUM functions is to build a hash index based on an expression or a column list. If say you use it to compute and store a column at the table level to denote the checksum over the columns that make a record unique in a table, then this can be helpful in determining whether a row has changed or not. This mechanism can then be used instead of joining with all the columns that make the record unique to see whether the record has been updated or not. SQL Server Books Online has a lot of examples on this piece of functionality.
A couple of things to watch out for when using these functions:
You need to make sure that the column(s) or expression order is the same between the two checksums that are being compared else the value would be different and will lead to issues.
We would not recommend using checksum(*) since the value that will get generated that way will be based on the column order of the table definition at run time which can easily change over a period of time. So, explicitly define the column listing.
Be careful when you include the datetime data-type columns since the granularity is 1/300th of a second and even a small variation will result into a different checksum value. So, if you have to use a datetime data-type column, then make sure that you get the exact date + hour/min. i.e. the level of granularity that you want.
There are three checksum functions available to you:
CHECKSUM: This was described above.
CHECKSUM_AGG: This returns the checksum of the values in a group and Null values are ignored in this case. This also works with the new analytic function’s OVER clause in SQL Server 2005.
BINARY_CHECKSUM: As the name states, this returns the binary checksum value computed over a row or a list of expressions. The difference between CHECKSUM and BINARY_CHECKSUM is in the value generated for the string data-types. An example of such a difference is the values generated for “DECIPHER” and “decipher” will be different in the case of a BINARY_CHECKSUM but will be the same for the CHECKSUM function (assuming that we have a case insensitive installation of the instance).
Another difference is in the comparison of expressions. BINARY_CHECKSUM() returns the same value if the elements of two expressions have the same type and byte representation. So, “2Volvo Director 20” and “3Volvo Director 30” will yield the same value, however the CHECKSUM() function evaluates the type as well as compares the two strings and if they are equal, then only the same value is returned.
Example:
STRING BINARY_CHECKSUM_USAGE CHECKSUM_USAGE
------------------- ---------------------- -----------
2Volvo Director 20 -1356512636 -341465450
3Volvo Director 30 -1356512636 -341453853
4Volvo Director 40 -1356512636 -341455363
HASHBYTES with MD5 is 5 times slower than CHECKSUM, I've tested this on a table with over 1 million rows, and ran each test 5 times to get an average.
Interestingly CHECKSUM takes exactly the same time as BINARY_CHECKSUM.
Here is my post with the full results published:
http://networkprogramming.wordpress.com/2011/01/14/binary_checksum-vs-hashbytes-in-sql/
I've found that checksum collisions (i.e. two different values returning the same checksum) are more common than most people seem to think. We have a table of currencies, using the ISO currency code as the PK. And in a table of less than 200 rows, there are three pairs of currency codes that return the same Binary_Checksum():
"ETB" and "EUR" (Ethiopian Birr and Euro) both return 16386.
"LTL" and "MDL" (Lithuanian Litas and Moldovan leu) both return 18700.
"TJS" and "UZS" (Somoni and Uzbekistan Som) both return 20723.
The same happens with ISO culture codes: "de" and "eu" (German and Basque) both return 1573.
Changing Binary_Checksum() to Checksum() fixes the problem in these cases...but in other cases it may not help. So my advice is to test thoroughly before relying too heavily on the uniqueness of these functions.
Be careful when using CHECSUM, you may get un-expected outcome. the following statements produce the same checksum value;
SELECT CHECKSUM (N'这么便宜怎么办?廉价iPhone售价再曝光', 5, 4102)
SELECT CHECKSUM (N'PlayStation Now – Sony startet Spiele-Streaming im Sommer 2014', 238, 13096)
Its easy to get collisions from CHECKSUM(). HASHBYTES() was added in SQL 2005 to enhance SQL Server's system hash functionality so I suggest you also look into this as an alternative.
You can get checksum value through this query:
SELECT
checksum(definition) as Checksum_Value,
definition
FROM sys.sql_modules
WHERE object_id = object_id('RefCRMCustomer_GetCustomerAdditionModificationDetail');
replace your proc name in the bracket.

VarBinary vs Image SQL Server Data Type to Store Binary Data?

I need to store binary files to the SQL Server Database. Which is the better Data Type out of Varbinary and Image?
Since image is deprecated, you should use varbinary.
per Microsoft (thanks for the link #Christopher)
ntext , text, and image data types will be removed in a future
version of Microsoft SQL Server. Avoid using these data types in new
development work, and plan to modify applications that currently use
them. Use nvarchar(max), varchar(max), and varbinary(max) instead.
Fixed and variable-length data types for storing large non-Unicode and
Unicode character and binary data. Unicode data uses the UNICODE UCS-2
character set.
varbinary(max) is the way to go (introduced in SQL Server 2005)
There is also the rather spiffy FileStream, introduced in SQL Server 2008.
https://learn.microsoft.com/en-us/sql/t-sql/data-types/ntext-text-and-image-transact-sql
image
Variable-length binary data from 0 through 2^31-1 (2,147,483,647)
bytes. Still it IS supported to use image datatype, but be aware of:
https://learn.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql
varbinary [ ( n | max) ]
Variable-length binary data. n can be a value from 1 through 8,000. max indicates that the maximum storage
size is 2^31-1 bytes. The storage size is the actual length of the
data entered + 2 bytes. The data that is entered can be 0 bytes in
length. The ANSI SQL synonym for varbinary is binary varying.
So both are equally in size (2GB). But be aware of:
https://learn.microsoft.com/en-us/sql/database-engine/deprecated-database-engine-features-in-sql-server-2016#features-not-supported-in-a-future-version-of-sql-server
Though the end of "image" datatype is still not determined, you should use the "future" proof equivalent.
But you have to ask yourself: why storing BLOBS in a Column?
https://learn.microsoft.com/en-us/sql/relational-databases/blob/compare-options-for-storing-blobs-sql-server

Resources