How to ensure specific character encoding in Microsoft SQL Server? - sql-server

What I need is to ensure that a string gets encoded in a known character encoding. So far, my research and testing with MS SQL Server has revealed that the documented encoding is 'UCS-2', however the actual encoding (on the server in question) is 'UCS-2LE'.
Which doesn't seem very reliable. What I would love is an ENCODE function as found in PERL, Node, or most anything, so that regardless of upgrades or settings changes, my hash function will be working on known input.
We can limit the hashing string to HEX, so at worst, we could manually map the 16 possible input characters to the proper bytes. Anyone have a recommendation on this?
Here's the PERL I'm using:
use Digest::SHA qw/sha256/;
use Encode qw/encode/;
$seed = 'DDFF5D36-F14D-495D-BAA6-3688786D6CFA';
$string = '123456789';
$target = '57392CD6A5192B6185C5999EB23D240BB7CEFD26E377D904F6FEF262ED176F97';
$encoded = encode('UCS-2LE', $seed.$string);
$sha256 = uc(unpack("H*", sha256($encoded)));
print "$target\n$sha256\n";
Which matches MS SQL:
HASHBYTES('SHA_256', 'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789')
But what I really want is:
HASHBYTES('SHA_256', ENCODE('UCS2-LE', 'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789'))
So that no matter what MS SQL happens to be encoding the input string as, the HASHBYTES will always operate on a known byte array.

SQL Server uses UCS-2 only on columns, variables and literals that were declared as nvarchar. In all other cases it uses 8-bit ASCII with the encoding of the current database, unless specified otherwise (using the collate clause, for example).
So, you either have to specify a Unicode literal:
select HASHBYTES('SHA_256', N'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789');
Or, you can use a variable or table column of the nvarchar data type:
-- Variable
declare #var nvarchar(128) = N'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789';
select HASHBYTES('SHA_256', #var);
-- Table column
declare #t table(
Value nvarchar(128)
);
insert into #t
select #var;
select HASHBYTES('SHA_256', t.Value)
from #t t;
P.S. Of course, since Wintel is a little-endian platform, SQL Server uses the same version of the encoding as the OS / hardware. Unless something new will come out in SQL Server 2017, there is no way to get big-endian representation in this universe natively.

Related

MS SQL Server EncryptByKey - String or binary data would be truncated

In theory varchar(max) and varbinary(max) columns should be capable of storing up to 2GB of data but I cannot store a unicode string 5000 characters long.
I've looked through other questions on this topic and they all suggest checking column sizes. I've done this and see that all related columns are declared with max size.
The key difference from similar questions is that, when storing I'm encrypting data using EncryptByKey and I think that it's the bottleneck I'm looking for. From MSDN I know that return type of EncryptByKey has max size of 8000 bytes, and it is not clear what is max size of #cleartext argument, but I suspect it's the same.
The following code gives me error :
OPEN SYMMETRIC KEY SK1 DECRYPTION BY CERTIFICATE Cert1;
DECLARE #tmp5k AS NVARCHAR(max);
SET #tmp5k = N'...5000 characters...';
SELECT EncryptByKey(Key_GUID('SK1'), #tmp5k);
GO
[22001][8152] String or binary data would be truncated.
How to encrypt and store big strings (around 5k unicode characters)?
So I ran into this issue when using C# and trying to encrypt and inserts a long JSON string into SQL. What ended up working was converting the plain-text string to binary and then using the same SQL EncryptByKey function to insert that instead.
If you're doing this is just SQL, I think you can use this function:
CONVERT(VARBINARY(MAX), #tmp5k) AS ToBinary
So using our example:
OPEN SYMMETRIC KEY SK1 DECRYPTION BY CERTIFICATE Cert1;
DECLARE #tmp5k AS NVARCHAR(max);
SET #tmp5k = N'...5000 characters...';
SELECT EncryptByKey(Key_GUID('SK1'), CONVERT(VARBINARY(MAX), #tmp5k));
GO
And here's an example of using SQL to convert the binary back to a string:
CONVERT(VARCHAR(100), CONVERT(VARBINARY(100), #TestString)) AS StringFromBinaryFromString ;

Central european characters in SQL

I have an issue. I have data stored on SQL server with central european characters like "č", "ř", "ž" etc. On the database I have the "Czech_CI_AS" collation which should accepted these characters. But when I try to select for example name of the street with this characters like this:
SELECT *
FROM Street where Name = 'Čáslavská'
It returns me nothing
When I remove the "č" it returns me what I need.
SELECT *
FROM Street where Name like '%áslavská'
I have this column in nvarchar type. But I cannot use the N character before my string because the external applications use this table for read and selects are made automaticlly.
Is here any solution? Or have I got something wrong?
Thanks for any help
#YuriyTsarkov really deservers the credit here. To elaborate on his answer.
From MSDN:
Prefix Unicode character string constants with the letter N. Without the N prefix, the string is converted to the default code page of the database. This default code page may not recognize certain characters.
Example
-- Storing Čáslavská in two vars, with and without N prefix.
DECLARE #Test_001 NVARCHAR(255) = 'Čáslavská' COLLATE Czech_CI_AS;
DECLARE #Test_002 NVARCHAR(255) = N'Čáslavská' COLLATE Czech_CI_AS;
-- Test output.
SELECT
#Test_001 AS T1,
#Test_002 AS T2
;
Returns
T1 T2
Cáslavská Čáslavská
You need to update all your external applications code to use selects with N, or, you need to change collation of your column to same, as used by external applications. It may cause some data loss.

How to Use UTF-8 Collation in SQL Server database?

I've migrated a database from mysql to SQL Server (politics), original mysql database using UTF8.
Now I read https://dba.stackexchange.com/questions/7346/sql-server-2005-2008-utf-8-collation-charset that SQL Server 2008 doesn't support utf8, is this a joke?
The SQL Server hosts multiple databases, mostly Latin-encoded. Since the migrated db is intended for web publishing, I want to keep the utf8-encoding. Have I missed something or do I need to enc/dec at application level?
UTF-8 is not a character set, it's an encoding. The character set for UTF-8 is Unicode. If you want to store Unicode text you use the nvarchar data type.
If the database would use UTF-8 to store text, you would still not get the text out as encoded UTF-8 data, you would get it out as decoded text.
You can easily store UTF-8 encoded text in the database, but then you don't store it as text, you store it as binary data (varbinary).
Looks like this will be finally supported in the SQL Server 2019!
SQL Server 2019 - whats new?
From BOL:
UTF-8 support
Full support for the widely used UTF-8 character encoding as an import
or export encoding, or as database-level or column-level collation for
text data. UTF-8 is allowed in the CHAR and VARCHAR datatypes, and is
enabled when creating or changing an object’s collation to a collation
with the UTF8 suffix.
For example,LATIN1_GENERAL_100_CI_AS_SC to
Latin1_General_100_CI_AS_KS_SC_UTF8. UTF-8 is only available to Windows
collations that support supplementary characters, as introduced in SQL
Server 2012. NCHAR and NVARCHAR allow UTF-16 encoding only, and remain
unchanged.
This feature may provide significant storage savings, depending on the
character set in use. For example, changing an existing column data
type with ASCII strings from NCHAR(10) to CHAR(10) using an UTF-8
enabled collation, translates into nearly 50% reduction in storage
requirements. This reduction is because NCHAR(10) requires 22 bytes
for storage, whereas CHAR(10) requires 12 bytes for the same Unicode
string.
2019-05-14 update:
Documentation seems to be updated now and explains our options staring in MSSQL 2019 in section "Collation and Unicode Support".
2019-07-24 update:
Article by Pedro Lopes - Senior Program Manager # Microsoft about introducing UTF-8 support for Azure SQL Database
No! It's not a joke.
Take a look here: http://msdn.microsoft.com/en-us/library/ms186939.aspx
Character data types that are either fixed-length, nchar, or
variable-length, nvarchar, Unicode data and use the UNICODE UCS-2
character set.
And also here: http://en.wikipedia.org/wiki/UTF-16
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.
Two UDF to deal with UTF-8 in T-SQL:
CREATE Function UcsToUtf8(#src nvarchar(MAX)) returns varchar(MAX) as
begin
declare #res varchar(MAX)='', #pi char(8)='%[^'+char(0)+'-'+char(127)+']%', #i int, #j int
select #i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0
begin
select #j=unicode(substring(#src,#i,1))
if #j<0x800 select #res=#res+left(#src,#i-1)+char((#j&1984)/64+192)+char((#j&63)+128)
else select #res=#res+left(#src,#i-1)+char((#j&61440)/4096+224)+char((#j&4032)/64+128)+char((#j&63)+128)
select #src=substring(#src,#i+1,datalength(#src)-1), #i=patindex(#pi,#src collate Latin1_General_BIN)
end
select #res=#res+#src
return #res
end
CREATE Function Utf8ToUcs(#src varchar(MAX)) returns nvarchar(MAX) as
begin
declare #i int, #res nvarchar(MAX)=#src, #pi varchar(18)
select #pi='%[à-ï][€-¿][€-¿]%',#i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0 select #res=stuff(#res,#i,3,nchar(((ascii(substring(#src,#i,1))&31)*4096)+((ascii(substring(#src,#i+1,1))&63)*64)+(ascii(substring(#src,#i+2,1))&63))), #src=stuff(#src,#i,3,'.'), #i=patindex(#pi,#src collate Latin1_General_BIN)
select #pi='%[Â-ß][€-¿]%',#i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0 select #res=stuff(#res,#i,2,nchar(((ascii(substring(#src,#i,1))&31)*64)+(ascii(substring(#src,#i+1,1))&63))), #src=stuff(#src,#i,2,'.'),#i=patindex(#pi,#src collate Latin1_General_BIN)
return #res
end
Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT, and OPENROWSET.
Addendum 2016-12-21: SQL Server 2016 SP1 now enables Unicode Compression (and most other previously Enterprise-only features) for all versions of MS SQL including Standard and Express. This is not the same as UTF-8 support, but it yields a similar benefit if the goal is disk space reduction for Western alphabets.

How to Show Eastern Letter(Chinese Character) on SQL Server/SQL Reporting Services?

I need to insert chinese characters in my database but it always show ???? ..
Example:
Insert this record.
微波室外单元-Apple
Then it became ???
Result:
??????-Apple
I really Need Help...thanks in regard.
I am using MSSQL Server 2008
Make sure you specify a unicode string with a capital N when you insert like:
INSERT INTO Table1 (Col1) SELECT N'微波室外单元-Apple' AS [Col1]
and that Table1 (Col1) is an NVARCHAR data type.
Make sure the column you're inserting to is nchar, nvarchar, or ntext. If you insert a Unicode string into an ANSI column, you really will get question marks in the data.
Also, be careful to check that when you pull the data back out you're not just seeing a client display problem but are actually getting the question marks back:
SELECT Unicode(YourColumn), YourColumn FROM YourTable
Note that the Unicode function returns the code of only the first character in the string.
Once you've determined whether the column is really storing the data correctly, post back and we'll help you more.
Try adding the appropriate languages to your Windows locale setings. you'll have to make sure your development machine is set to display Non-Unicode characters in the appropriate language.
And ofcourse u need to use NVarchar for foreign language feilds
Make sure that you have set an encoding for the database to one that supports these characters. UTF-8 is the de facto encoding as it's ASCII compatible but supports all 1114111 Unicode code points.
SELECT 'UPDATE table SET msg=UNISTR('''||ASCIISTR(msg)||''') WHERE id='''||id||''' FROM table WHERE id= '123344556' ;

SQL Server Text Datatype Maxlength = 65,535?

Software I'm working with uses a text field to store XML. From my searches online, the text datatype is supposed to hold 2^31 - 1 characters. Currently SQL Server is truncating the XML at 65,535 characters every time. I know this is caused by SQL Server, because if I add a 65,536th character to the column directly in Management Studio, it states that it will not update because characters will be truncated.
Is the max length really 65,535 or could this be because the database was designed in an earlier version of SQL Server (2000) and it's using the legacy text datatype instead of 2005's?
If this is the case, will altering the datatype to Text in SQL Server 2005 fix this issue?
that is a limitation of SSMS not of the text field, but you should use varchar(max) since text is deprecated
Here is also a quick test
create table TestLen (bla text)
insert TestLen values (replicate(convert(varchar(max),'a'), 100000))
select datalength(bla)
from TestLen
Returns 100000 for me
MSSQL 2000 should allow up to 2^31 - 1 characters (non unicode) in a text field, which is over 2 billion. Don't know what's causing this limitation but you might wanna try using varchar(max) or nvarchar(max). These store as many characters but allow also the regular string T-SQL functions (like LEN, SUBSTRING, REPLACE, RTRIM,...).
If you're able to convert the column, you might as well, since the text data type will be removed in a future version of SQL Server. See here.
The recommendation is to use varchar(MAX) or nvarchar(MAX). In your case, you could also use the XML data type, but that may tie you to certain database engines (if that's a consideration).
You should have a look at
XML Support in Microsoft SQL Server
2005
Beginning SQL Server 2005 XML
Programming
So I would rather try to use the data type appropriate for the use. Not make a datatype fit your use from a previous version.
Here's a little script I wrote for getting out all data
SELECT #data = N'huge data';
DECLARE #readSentence NVARCHAR (MAX) = N'';
DECLARE #dataLength INT = ( SELECT LEN (#data));
DECLARE #currIndex INT = 0;
WHILE #data <> #readSentence
BEGIN
DECLARE #temp NVARCHAR (MAX) = N'';
SET #temp = ( SELECT SUBSTRING (#data, #currIndex, 65535));
SELECT #temp;
SET #readSentence += #temp;
SET #currIndex += 65535;
END;

Resources