I am facing a problem when applying SHA256 hash to German Umlaut Characters.
--Without Umlaut
SELECT CONVERT(VARCHAR(MAX), HASHBYTES('SHA2_256','o'), 2) as HASH_ID
Sql server Output 65C74C15A686187BB6BBF9958F494FC6B80068034A659A9AD44991B08C58F2D2
This is matching to the output in
https://www.pelock.com/products/hash-calculator
--With Umlaut
SELECT CONVERT(VARCHAR(MAX), HASHBYTES('SHA2_256','ö'), 2)
Sql server Output B0B2988B6BBE724BACDA5E9E524736DE0BC7DAE41C46B4213C50E1D35D4E5F13
Output from pelock: 6DBD11FD012E225B28A5D94A9B432BC491344F3E92158661BE2AE5AE2B8B1AD8
I want the SQL server output to match to pelock. I have tested outputs from other sources (Snowflake and python) and all of it aligns with output from pelock. Not sure why SQL server is not giving the right result. Any help is much appreciated.
You have two issues:
The literal text itself is being reinterpreted, because you have the wrong database collation. You can use the N prefix to prevent that, but this leads to a second problem...
The value from pelock is UTF-8, but using N means it will be UTF-16 nvarchar.
So you need to use a UTF-8 binary collation, the N prefix and cast it back to varchar.
SELECT CONVERT(VARCHAR(MAX), HASHBYTES('SHA2_256',CAST(N'ö' COLLATE Latin1_General_100_BIN2_UTF8 AS varchar(100))), 2)
Result
6DBD11FD012E225B28A5D94A9B432BC491344F3E92158661BE2AE5AE2B8B1AD8
db<>fiddle
UTF-8 collations are only supported in SQL Server 2019 and later. In older version you would need to find a different collation that deals with the characters you have. It may not be possible to find a collation that deals with all of your data.
I'm using Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64)
Aug 22 2017 17:04:49
Copyright (C) 2017 Microsoft Corporation
Developer Edition (64-bit) on Windows Server 2012 R2 Standard Evaluation 6.3 <X64> (Build 9600: ) (Hypervisor)
When i do: INSERT INTO MYTABLE(COLNAME) VALUES ('Қазақстан') it inserts question marks ????????? instead of actual string. My column type is NVARCHAR(255). If i insert N character before my value like this: ...(N'Қазақстан') it will be persisted properly. Is it the only way to insert non-ASCII characters or should i change something else?
Thanks!
The answer to your question is Yes, you should use N
From Docs
When prefixing a string constant with the letter N, the implicit conversion will result in a UCS-2 or UTF-16 string if the constant to convert does not exceed the max length for the nvarchar string data type (4,000). Otherwise, the implicit conversion will result in a large-value nvarchar(max).
So if you don't use N then SQL Server will treate it as VARCHAR (non unicode).
Using the N will convert the string to unicode string.
There are already sever question and solution on accent insensitive search on stackoverflow, but none of them work for codepage 1250 (Central European and Eastern European languages).
How do I perform an accent insensitive compare (e with è, é, ê and ë) in SQL Server?
LINQ Where Ignore Accentuation and Case
Ignoring accents in SQL Server using LINQ to SQL
Modify search to make it Accent Insensitive in SQL Server
Questions about accent insensitivity in SQL Server (Latin1_General_CI_AS)
The problem is that accent insensitive collation are bidned to some specific codepages and that I am missing accent insensitive collation for 1250 codepage in MSDN documentation.
I need to modify the collation of the column to make Entity Framework working in accent insensitive way.
For example if I change a collation to SQL_LATIN1_GENERAL_CP1_CI_AI, c with accute is select as c without accute (U+0107) because wrong codepage.
How to solve this?
SELECT *
FROM sys.fn_helpcollations()
WHERE COLLATIONPROPERTY(name, 'CodePage') = 1250
AND description LIKE '%accent-insensitive%';
Returns 264 results to choose from.
Picking the first one
SELECT N'è' COLLATE Albanian_CI_AI
UNION
SELECT N'é'
UNION
SELECT N'ê'
UNION
SELECT N'ë'
returns a single row as desired (showing all compared equal)
OK, it seems that the link MSDN documentation is for SQL server 2008 I use SQL Server 2014, but I was not able to find any collation documentation for 2014.
But the solution is to list the collations from server for my code page:
SELECT name, COLLATIONPROPERTY(name, 'CodePage') AS CodePage
FROM fn_helpcollations()
where COLLATIONPROPERTY(name, 'CodePage') = 1250
ORDER BY name;
And I can see there is a undocumented collation Czech_100_CI_AI which works for me. Heureka!
I am running a select statement which contains the following CASE clause:
SELECT
(CASE MyTable.IsBook WHEN 1 THEN 'B' ELSE 'M' END) AS IsBookOrManuscript
FROM MyTable;
I have the same exact database(schema and data) restored in two different physical servers running SQL Server 2008 R2 with build version 10.50.4276.0 and SQL Server 2014 respectively.
When run in SQL Server 2014 the query returns as expected. When run in SQL Server 2008R2 the error message Incorrect syntax near ' '. occurs.
Searching the script file for non-ascii characters I found that indeed, three occurrences of 0x0A character appear in the CASE line and removing it solves the problem in SQL Server 2008R2.
Does anyone know why that happens? Is it an intended behavior? As far as I can see there are 2 CUs released since my last update BUT they do not seem to fix or recognize the problem. Any thoughts?
Check the Setting of
QUOTED_IDENTIFIER
on each server.
as MSDN says:
Causes SQL Server to follow the ISO rules regarding quotation mark delimiting identifiers and literal strings. Identifiers delimited by double quotation marks can be either Transact-SQL reserved keywords or can contain characters not generally allowed by the Transact-SQL syntax rules for identifiers.
so try the next code:
SET QUOTED_IDENTIFIER off
-- Type your code here
SET QUOTED_IDENTIFIER On
I've migrated a database from mysql to SQL Server (politics), original mysql database using UTF8.
Now I read https://dba.stackexchange.com/questions/7346/sql-server-2005-2008-utf-8-collation-charset that SQL Server 2008 doesn't support utf8, is this a joke?
The SQL Server hosts multiple databases, mostly Latin-encoded. Since the migrated db is intended for web publishing, I want to keep the utf8-encoding. Have I missed something or do I need to enc/dec at application level?
UTF-8 is not a character set, it's an encoding. The character set for UTF-8 is Unicode. If you want to store Unicode text you use the nvarchar data type.
If the database would use UTF-8 to store text, you would still not get the text out as encoded UTF-8 data, you would get it out as decoded text.
You can easily store UTF-8 encoded text in the database, but then you don't store it as text, you store it as binary data (varbinary).
Looks like this will be finally supported in the SQL Server 2019!
SQL Server 2019 - whats new?
From BOL:
UTF-8 support
Full support for the widely used UTF-8 character encoding as an import
or export encoding, or as database-level or column-level collation for
text data. UTF-8 is allowed in the CHAR and VARCHAR datatypes, and is
enabled when creating or changing an object’s collation to a collation
with the UTF8 suffix.
For example,LATIN1_GENERAL_100_CI_AS_SC to
Latin1_General_100_CI_AS_KS_SC_UTF8. UTF-8 is only available to Windows
collations that support supplementary characters, as introduced in SQL
Server 2012. NCHAR and NVARCHAR allow UTF-16 encoding only, and remain
unchanged.
This feature may provide significant storage savings, depending on the
character set in use. For example, changing an existing column data
type with ASCII strings from NCHAR(10) to CHAR(10) using an UTF-8
enabled collation, translates into nearly 50% reduction in storage
requirements. This reduction is because NCHAR(10) requires 22 bytes
for storage, whereas CHAR(10) requires 12 bytes for the same Unicode
string.
2019-05-14 update:
Documentation seems to be updated now and explains our options staring in MSSQL 2019 in section "Collation and Unicode Support".
2019-07-24 update:
Article by Pedro Lopes - Senior Program Manager # Microsoft about introducing UTF-8 support for Azure SQL Database
No! It's not a joke.
Take a look here: http://msdn.microsoft.com/en-us/library/ms186939.aspx
Character data types that are either fixed-length, nchar, or
variable-length, nvarchar, Unicode data and use the UNICODE UCS-2
character set.
And also here: http://en.wikipedia.org/wiki/UTF-16
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.
Two UDF to deal with UTF-8 in T-SQL:
CREATE Function UcsToUtf8(#src nvarchar(MAX)) returns varchar(MAX) as
begin
declare #res varchar(MAX)='', #pi char(8)='%[^'+char(0)+'-'+char(127)+']%', #i int, #j int
select #i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0
begin
select #j=unicode(substring(#src,#i,1))
if #j<0x800 select #res=#res+left(#src,#i-1)+char((#j&1984)/64+192)+char((#j&63)+128)
else select #res=#res+left(#src,#i-1)+char((#j&61440)/4096+224)+char((#j&4032)/64+128)+char((#j&63)+128)
select #src=substring(#src,#i+1,datalength(#src)-1), #i=patindex(#pi,#src collate Latin1_General_BIN)
end
select #res=#res+#src
return #res
end
CREATE Function Utf8ToUcs(#src varchar(MAX)) returns nvarchar(MAX) as
begin
declare #i int, #res nvarchar(MAX)=#src, #pi varchar(18)
select #pi='%[à-ï][€-¿][€-¿]%',#i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0 select #res=stuff(#res,#i,3,nchar(((ascii(substring(#src,#i,1))&31)*4096)+((ascii(substring(#src,#i+1,1))&63)*64)+(ascii(substring(#src,#i+2,1))&63))), #src=stuff(#src,#i,3,'.'), #i=patindex(#pi,#src collate Latin1_General_BIN)
select #pi='%[Â-ß][€-¿]%',#i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0 select #res=stuff(#res,#i,2,nchar(((ascii(substring(#src,#i,1))&31)*64)+(ascii(substring(#src,#i+1,1))&63))), #src=stuff(#src,#i,2,'.'),#i=patindex(#pi,#src collate Latin1_General_BIN)
return #res
end
Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT, and OPENROWSET.
Addendum 2016-12-21: SQL Server 2016 SP1 now enables Unicode Compression (and most other previously Enterprise-only features) for all versions of MS SQL including Standard and Express. This is not the same as UTF-8 support, but it yields a similar benefit if the goal is disk space reduction for Western alphabets.