Hash function results on Reshift differs from SQL Server - sql-server

I have a table hash_table both in Microsoft SQL Server and AWS Redshift.
Redshift
Column Name
Data Type
phone
numeric(8,8)
name
string(2147483647)
SQL Server
Column Name
Data Type
phone
numeric(8,0)
name
nvarchar(80)
I want to extract a hash value from both tables so I can automate the value comparison. But even when I have the same values in both sides, the hash value from each field isn't the same.
I suppose it has sth to do with the data types but I haven't found anything regardig this matter on hash articles.
Am I doing sth wrong?
Here are the functions I've used and them results. At first I tryed with column name but, once the data type differs from each database, I decided using phone:
Redshift
SELECT TOP (1)
len(TelephonyExtension) as PhoneLen,
TelephonyExtension as Phone,
MD5(CONVERT(nvarchar(30), phone)) as Hash
FROM hash_table
Result:
PhoneLen
Phone
Hash
1
1
cfcd208495d565ef62e7dff9f98764fa
SQL Server
SELECT TOP (1)
len(TelephonyExtension) as PhoneLen,
TelephonyExtension as Phone,
HASHBYTES('MD5', CONVERT(nvarchar(30), phone)) as Hash
FROM hash_table
PhoneLen
Phone
Hash
1
1
A46C3B54F2C9871CD91DAF7A932499X0
I have also used sha2_256 instead of MD5 but the problem persists
I expected the hash columns to have the same value in both systems for any type of column.

Hash operates on strings. If the hash is different then the strings are likely different. Add MD5(CONVERT(nvarchar(30), phone)) as Hash to you selects and post if there are differences.
I've done this a few times for clients and getting the strings to match exactly between two DBs can be tricky. Any extra spaces, non-printing chars, or puff of wind can make this mismatch.

Related

How to query Azure SQL server with specific type of hyphen?

I have some garbage in my DB. I want to know how much of it I have.
The correct data should be like this foo-bar. Unfortunately I also have foo-bar.
If I run both queries I get the same data.
select * from data
where field like '%-%';
select * from data
where field like '%-%';
result for both queries
foo-bar
foo-bar
How can I query only for the data that contains - and not -?
Both azure web client and azure data studio seem to somehow convert this char as if there is only one.
These are two different Unicode characters https://www.codetable.net/decimal/45 versus https://www.codetable.net/decimal/65293
You have 2 issues. One is that Unicode literal strings need an N prefix to denote Unicode. The other is a binary collation cast is needed to compare in the code points rather than characters:
SELECT * FROM data
WHERE field LIKE N'%-%' COLLATE Latin1_General_BIN;
SELECT * FROM data
WHERE field LIKE N'%-%' COLLATE Latin1_General_BIN;

Issue with datatype Money in SQL SERVER vs string

I have a spreadsheet that gets all values loaded into SQL Server. One of the fields in the spreadsheet happens to be money. Now in order for everything to be displayed correcctly - i added a field in my tbl with Money as DataType.
When i read the value from spreadsheet I pretty much store it as a String, such as this... "94259.4". When it get's inserted in sql server it looks like this "94259.4000". Is there a way for me to basically get rid of the 0's in the sql server value when I grab it from DB - because the issue I'm running across is that - even though these two values are the same - because they are both compared as Strings - it thinks that there not the same values.
I'm foreseeing another issue when the value might look like this...94,259.40 I think what might work is limiting the numbers to 2 after the period. So as long as I select the value from Server with this format 94,259.40 - I thin I should be okay.
EDIT:
For Column = 1 To 34
Select Case Column
Case 1 'Field 1
If Not ([String].IsNullOrEmpty(CStr(excel.Cells(Row, Column).Value)) Or CStr(excel.Cells(Row, Column).Value) = "") Then
strField1 = CStr(excel.Cells(Row, Column).Value)
End If
Case 2 'Field 2
' and so on
I go through each field and store the value as a string. Then I compare it against the DB and see if there is a record that has the same values. The only field in my way is the Money field.
You can use the Format() to compare strings, or even Float For example:
Declare #YourTable table (value money)
Insert Into #YourTable values
(94259.4000),
(94259.4500),
(94259.0000)
Select Original = value
,AsFloat = cast(value as float)
,Formatted = format(value,'0.####')
From #YourTable
Returns
Original AsFloat Formatted
94259.40 94259.4 94259.4
94259.45 94259.45 94259.45
94259.00 94259 94259
I should note that Format() has some great functionality, but it is NOT known for its performance
The core issue is that string data is being used to represent numeric information, hence the problems comparing "123.400" to "123.4" and getting mismatches. They should mismatch. They're strings.
The solution is to store the data in the spreadsheet in its proper form - numeric, and then select a proper format for the database - which is NOT the "Money" datatype (insert shudders and visions of vultures circling overhead). Otherwise, you are going to have an expanding kluge of conversions between types as you go back and forth between two improperly designed solutions, and finding more and more edge cases that "don't quite work," and require more special cases...and so on.

SQL Server - Text column contains number between

Can someone tell me how I code in SQL Server so that I am looking in a varchar text column to see if it contains a numerical range within the text?
For example, I'm looking for columns that contain anything between 100000 and 999999. The column may have a value like
this field contains a number `567391`
so I want to select that one, but not if it had
this field contains a number `5391`
For your given example, you can check the digits:
where col like '%[^0-9][1-9][0-9][0-9][0-9][0-9][0-9][^0-9]%'
This is not a generic solution, but it works for your example. In general, parsing strings in SQL Server is difficult. It is better to extract the values you are interested in when loading the data, so the relevant values are correctly in their own columns.

What are the differences between CHECKSUM() and BINARY_CHECKSUM() and when/what are the appropriate usage scenarios?

Again MSDN does not really explain in plain English the exact difference, or the information for when to choose one over the other.
CHECKSUM
Returns the checksum value computed over a row of a table, or over a list of expressions. CHECKSUM is intended for use in building hash indexes.
BINARY_CHECKSUM
Returns the binary checksum value computed over a row of a table or over a list of expressions. BINARY_CHECKSUM can be used to detect changes to a row of a table.
It does hint that binary checksum should be used to detect row changes, but not why.
Check out the following blog post that highlights the diferences.
http://decipherinfosys.wordpress.com/2007/05/18/checksum-functions-in-sql-server-2005/
Adding info from this link:
The key intent of the CHECKSUM functions is to build a hash index based on an expression or a column list. If say you use it to compute and store a column at the table level to denote the checksum over the columns that make a record unique in a table, then this can be helpful in determining whether a row has changed or not. This mechanism can then be used instead of joining with all the columns that make the record unique to see whether the record has been updated or not. SQL Server Books Online has a lot of examples on this piece of functionality.
A couple of things to watch out for when using these functions:
You need to make sure that the column(s) or expression order is the same between the two checksums that are being compared else the value would be different and will lead to issues.
We would not recommend using checksum(*) since the value that will get generated that way will be based on the column order of the table definition at run time which can easily change over a period of time. So, explicitly define the column listing.
Be careful when you include the datetime data-type columns since the granularity is 1/300th of a second and even a small variation will result into a different checksum value. So, if you have to use a datetime data-type column, then make sure that you get the exact date + hour/min. i.e. the level of granularity that you want.
There are three checksum functions available to you:
CHECKSUM: This was described above.
CHECKSUM_AGG: This returns the checksum of the values in a group and Null values are ignored in this case. This also works with the new analytic function’s OVER clause in SQL Server 2005.
BINARY_CHECKSUM: As the name states, this returns the binary checksum value computed over a row or a list of expressions. The difference between CHECKSUM and BINARY_CHECKSUM is in the value generated for the string data-types. An example of such a difference is the values generated for “DECIPHER” and “decipher” will be different in the case of a BINARY_CHECKSUM but will be the same for the CHECKSUM function (assuming that we have a case insensitive installation of the instance).
Another difference is in the comparison of expressions. BINARY_CHECKSUM() returns the same value if the elements of two expressions have the same type and byte representation. So, “2Volvo Director 20” and “3Volvo Director 30” will yield the same value, however the CHECKSUM() function evaluates the type as well as compares the two strings and if they are equal, then only the same value is returned.
Example:
STRING BINARY_CHECKSUM_USAGE CHECKSUM_USAGE
------------------- ---------------------- -----------
2Volvo Director 20 -1356512636 -341465450
3Volvo Director 30 -1356512636 -341453853
4Volvo Director 40 -1356512636 -341455363
HASHBYTES with MD5 is 5 times slower than CHECKSUM, I've tested this on a table with over 1 million rows, and ran each test 5 times to get an average.
Interestingly CHECKSUM takes exactly the same time as BINARY_CHECKSUM.
Here is my post with the full results published:
http://networkprogramming.wordpress.com/2011/01/14/binary_checksum-vs-hashbytes-in-sql/
I've found that checksum collisions (i.e. two different values returning the same checksum) are more common than most people seem to think. We have a table of currencies, using the ISO currency code as the PK. And in a table of less than 200 rows, there are three pairs of currency codes that return the same Binary_Checksum():
"ETB" and "EUR" (Ethiopian Birr and Euro) both return 16386.
"LTL" and "MDL" (Lithuanian Litas and Moldovan leu) both return 18700.
"TJS" and "UZS" (Somoni and Uzbekistan Som) both return 20723.
The same happens with ISO culture codes: "de" and "eu" (German and Basque) both return 1573.
Changing Binary_Checksum() to Checksum() fixes the problem in these cases...but in other cases it may not help. So my advice is to test thoroughly before relying too heavily on the uniqueness of these functions.
Be careful when using CHECSUM, you may get un-expected outcome. the following statements produce the same checksum value;
SELECT CHECKSUM (N'这么便宜怎么办?廉价iPhone售价再曝光', 5, 4102)
SELECT CHECKSUM (N'PlayStation Now – Sony startet Spiele-Streaming im Sommer 2014', 238, 13096)
Its easy to get collisions from CHECKSUM(). HASHBYTES() was added in SQL 2005 to enhance SQL Server's system hash functionality so I suggest you also look into this as an alternative.
You can get checksum value through this query:
SELECT
checksum(definition) as Checksum_Value,
definition
FROM sys.sql_modules
WHERE object_id = object_id('RefCRMCustomer_GetCustomerAdditionModificationDetail');
replace your proc name in the bracket.

Ordering numbers that are stored as strings in the database

I have a bunch of records in several tables in a database that have a "process number" field, that's basically a number, but I have to store it as a string both because of some legacy data that has stuff like "89a" as a number and some numbering system that requires that process numbers be represented as number/year.
The problem arises when I try to order the processes by number. I get stuff like:
1
10
11
12
And the other problem is when I need to add a new process. The new process' number should be the biggest existing number incremented by one, and for that I would need a way to order the existing records by number.
Any suggestions?
Maybe this will help.
Essentially:
SELECT process_order FROM your_table ORDER BY process_order + 0 ASC
Can you store the numbers as zero padded values? That is, 01, 10, 11, 12?
I would suggest to create a new numeric field used only for ordering and update it from a trigger.
Can you split the data into two fields?
Store the 'process number' as an int and the 'process subtype' as a string.
That way:
you can easily get the MAX processNumber - and increment it when you need to generate a
new number
you can ORDER BY processNumber ASC,
processSubtype ASC - to get the
correct order, even if multiple records have the same base number with different years/letters appended
when you need the 'full' number you
can just concatenate the two fields
Would that do what you need?
Given that your process numbers don't seem to follow any fixed patterns (from your question and comments), can you construct/maintain a process number table that has two fields:
create table process_ordering ( processNumber varchar(N), processOrder int )
Then select all the process numbers from your tables and insert into the process number table. Set the ordering however you want based on the (varying) process number formats. Join on this table, order by processOrder and select all fields from the other table. Index this table on processNumber to make the join fast.
select my_processes.*
from my_processes
inner join process_ordering on my_process.processNumber = process_ordering.processNumber
order by process_ordering.processOrder
It seems to me that you have two tasks here.
• Convert the strings to numbers by legacy format/strip off the junk• Order the numbers
If you have a practical way of introducing string-parsing regular expressions into your process (and your issue has enough volume to be worth the effort), then I'd
• Create a reference table such as
CREATE TABLE tblLegacyFormatRegularExpressionMaster(
LegacyFormatId int,
LegacyFormatName varchar(50),
RegularExpression varchar(max)
)
• Then, with a way of invoking the regular expressions, such as the CLR integration in SQL Server 2005 and above (the .NET Common Language Runtime integration to allow calls to compiled .NET methods from within SQL Server as ordinary (Microsoft extended) T-SQL, then you should be able to solve your problem.
• See
http://www.codeproject.com/KB/string/SqlRegEx.aspx
I apologize if this is way too much overhead for your problem at hand.
Suggestion:
• Make your column a fixed width text (i.e. CHAR rather than VARCHAR).
• Pad the existing values with enough leading zeros to fill each column and a trailing space(s) where the values do not end in 'a' (or whatever).
• Add a CHECK constraint (or equivalent) to ensure new values conform to the pattern e.g. something like
CHECK (process_number LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][ab ]')
• In your insert/update stored procedures (or equivalent), pad any incoming values to fit the pattern.
• Remove the leading/trailing zeros/spaces as appropriate when displaying the values to humans.
Another advantage of this approach is that the incoming values '1', '01', '001', etc would all be considered to be the same value and could be covered by a simple unique constraint in the DBMS.
BTW I like the idea of splitting the trailing 'a' (or whatever) into a separate column, however I got the impression the data element in question is an identifier and therefore would not be appropriate to split it.
You need to cast your field as you're selecting. I'm basing this syntax on MySQL - but the idea's the same:
select * from table order by cast(field AS UNSIGNED);
Of course UNSIGNED could be SIGNED if required.

Resources