How to Use UTF-8 Collation in SQL Server database?

How to Use UTF-8 Collation in SQL Server database? - sql-server

I've migrated a database from mysql to SQL Server (politics), original mysql database using UTF8.
Now I read https://dba.stackexchange.com/questions/7346/sql-server-2005-2008-utf-8-collation-charset that SQL Server 2008 doesn't support utf8, is this a joke?
The SQL Server hosts multiple databases, mostly Latin-encoded. Since the migrated db is intended for web publishing, I want to keep the utf8-encoding. Have I missed something or do I need to enc/dec at application level?

UTF-8 is not a character set, it's an encoding. The character set for UTF-8 is Unicode. If you want to store Unicode text you use the nvarchar data type.
If the database would use UTF-8 to store text, you would still not get the text out as encoded UTF-8 data, you would get it out as decoded text.
You can easily store UTF-8 encoded text in the database, but then you don't store it as text, you store it as binary data (varbinary).

Looks like this will be finally supported in the SQL Server 2019!
SQL Server 2019 - whats new?
From BOL:
UTF-8 support
Full support for the widely used UTF-8 character encoding as an import
or export encoding, or as database-level or column-level collation for
text data. UTF-8 is allowed in the CHAR and VARCHAR datatypes, and is
enabled when creating or changing an object’s collation to a collation
with the UTF8 suffix.
For example,LATIN1_GENERAL_100_CI_AS_SC to
Latin1_General_100_CI_AS_KS_SC_UTF8. UTF-8 is only available to Windows
collations that support supplementary characters, as introduced in SQL
Server 2012. NCHAR and NVARCHAR allow UTF-16 encoding only, and remain
unchanged.
This feature may provide significant storage savings, depending on the
character set in use. For example, changing an existing column data
type with ASCII strings from NCHAR(10) to CHAR(10) using an UTF-8
enabled collation, translates into nearly 50% reduction in storage
requirements. This reduction is because NCHAR(10) requires 22 bytes
for storage, whereas CHAR(10) requires 12 bytes for the same Unicode
string.
2019-05-14 update:
Documentation seems to be updated now and explains our options staring in MSSQL 2019 in section "Collation and Unicode Support".
2019-07-24 update:
Article by Pedro Lopes - Senior Program Manager # Microsoft about introducing UTF-8 support for Azure SQL Database

No! It's not a joke.
Take a look here: http://msdn.microsoft.com/en-us/library/ms186939.aspx
Character data types that are either fixed-length, nchar, or
variable-length, nvarchar, Unicode data and use the UNICODE UCS-2
character set.
And also here: http://en.wikipedia.org/wiki/UTF-16
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.

Two UDF to deal with UTF-8 in T-SQL:
CREATE Function UcsToUtf8(#src nvarchar(MAX)) returns varchar(MAX) as
begin
declare #res varchar(MAX)='', #pi char(8)='%[^'+char(0)+'-'+char(127)+']%', #i int, #j int
select #i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0
begin
select #j=unicode(substring(#src,#i,1))
if #j<0x800 select #res=#res+left(#src,#i-1)+char((#j&1984)/64+192)+char((#j&63)+128)
else select #res=#res+left(#src,#i-1)+char((#j&61440)/4096+224)+char((#j&4032)/64+128)+char((#j&63)+128)
select #src=substring(#src,#i+1,datalength(#src)-1), #i=patindex(#pi,#src collate Latin1_General_BIN)
end
select #res=#res+#src
return #res
end
CREATE Function Utf8ToUcs(#src varchar(MAX)) returns nvarchar(MAX) as
begin
declare #i int, #res nvarchar(MAX)=#src, #pi varchar(18)
select #pi='%[à-ï][€-¿][€-¿]%',#i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0 select #res=stuff(#res,#i,3,nchar(((ascii(substring(#src,#i,1))&31)*4096)+((ascii(substring(#src,#i+1,1))&63)*64)+(ascii(substring(#src,#i+2,1))&63))), #src=stuff(#src,#i,3,'.'), #i=patindex(#pi,#src collate Latin1_General_BIN)
select #pi='%[Â-ß][€-¿]%',#i=patindex(#pi,#src collate Latin1_General_BIN)
while #i>0 select #res=stuff(#res,#i,2,nchar(((ascii(substring(#src,#i,1))&31)*64)+(ascii(substring(#src,#i+1,1))&63))), #src=stuff(#src,#i,2,'.'),#i=patindex(#pi,#src collate Latin1_General_BIN)
return #res
end

Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT, and OPENROWSET.
Addendum 2016-12-21: SQL Server 2016 SP1 now enables Unicode Compression (and most other previously Enterprise-only features) for all versions of MS SQL including Standard and Express. This is not the same as UTF-8 support, but it yields a similar benefit if the goal is disk space reduction for Western alphabets.

Related

German Umlaut hash - SHA256 on SQL server

I am facing a problem when applying SHA256 hash to German Umlaut Characters.
--Without Umlaut
SELECT CONVERT(VARCHAR(MAX), HASHBYTES('SHA2_256','o'), 2) as HASH_ID
Sql server Output 65C74C15A686187BB6BBF9958F494FC6B80068034A659A9AD44991B08C58F2D2
This is matching to the output in
https://www.pelock.com/products/hash-calculator
--With Umlaut
SELECT CONVERT(VARCHAR(MAX), HASHBYTES('SHA2_256','ö'), 2)
Sql server Output B0B2988B6BBE724BACDA5E9E524736DE0BC7DAE41C46B4213C50E1D35D4E5F13
Output from pelock: 6DBD11FD012E225B28A5D94A9B432BC491344F3E92158661BE2AE5AE2B8B1AD8
I want the SQL server output to match to pelock. I have tested outputs from other sources (Snowflake and python) and all of it aligns with output from pelock. Not sure why SQL server is not giving the right result. Any help is much appreciated.

You have two issues:
The literal text itself is being reinterpreted, because you have the wrong database collation. You can use the N prefix to prevent that, but this leads to a second problem...
The value from pelock is UTF-8, but using N means it will be UTF-16 nvarchar.
So you need to use a UTF-8 binary collation, the N prefix and cast it back to varchar.
SELECT CONVERT(VARCHAR(MAX), HASHBYTES('SHA2_256',CAST(N'ö' COLLATE Latin1_General_100_BIN2_UTF8 AS varchar(100))), 2)
Result
6DBD11FD012E225B28A5D94A9B432BC491344F3E92158661BE2AE5AE2B8B1AD8
db<>fiddle
UTF-8 collations are only supported in SQL Server 2019 and later. In older version you would need to find a different collation that deals with the characters you have. It may not be possible to find a collation that deals with all of your data.

How to ensure specific character encoding in Microsoft SQL Server?

What I need is to ensure that a string gets encoded in a known character encoding. So far, my research and testing with MS SQL Server has revealed that the documented encoding is 'UCS-2', however the actual encoding (on the server in question) is 'UCS-2LE'.
Which doesn't seem very reliable. What I would love is an ENCODE function as found in PERL, Node, or most anything, so that regardless of upgrades or settings changes, my hash function will be working on known input.
We can limit the hashing string to HEX, so at worst, we could manually map the 16 possible input characters to the proper bytes. Anyone have a recommendation on this?
Here's the PERL I'm using:
use Digest::SHA qw/sha256/;
use Encode qw/encode/;
$seed = 'DDFF5D36-F14D-495D-BAA6-3688786D6CFA';
$string = '123456789';
$target = '57392CD6A5192B6185C5999EB23D240BB7CEFD26E377D904F6FEF262ED176F97';
$encoded = encode('UCS-2LE', $seed.$string);
$sha256 = uc(unpack("H*", sha256($encoded)));
print "$target\n$sha256\n";
Which matches MS SQL:
HASHBYTES('SHA_256', 'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789')
But what I really want is:
HASHBYTES('SHA_256', ENCODE('UCS2-LE', 'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789'))
So that no matter what MS SQL happens to be encoding the input string as, the HASHBYTES will always operate on a known byte array.

SQL Server uses UCS-2 only on columns, variables and literals that were declared as nvarchar. In all other cases it uses 8-bit ASCII with the encoding of the current database, unless specified otherwise (using the collate clause, for example).
So, you either have to specify a Unicode literal:
select HASHBYTES('SHA_256', N'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789');
Or, you can use a variable or table column of the nvarchar data type:
-- Variable
declare #var nvarchar(128) = N'DDFF5D36-F14D-495D-BAA6-3688786D6CFA123456789';
select HASHBYTES('SHA_256', #var);
-- Table column
declare #t table(
Value nvarchar(128)
);
insert into #t
select #var;
select HASHBYTES('SHA_256', t.Value)
from #t t;
P.S. Of course, since Wintel is a little-endian platform, SQL Server uses the same version of the encoding as the OS / hardware. Unless something new will come out in SQL Server 2017, there is no way to get big-endian representation in this universe natively.

Missing nvarchar columns when reading SQL Server database table from Oracle

I have a SQL Server database with a table that has a column of nvarchar(4000) data type. When I try to read the data from Oracle through a dblink, I don't see the nvarchar(4000) column. All the other column's data is displayed properly.
Can anyone help me to find the issue here and how to fix it?

Appendix A-1 ...
ODBC Oracle Comment
SQL_WCHAR NCHAR -
SQL_WVARCHAR
NVARCHAR - SQL_WLONGVARCHAR LONG if Oracle DB Character Set = Unicode.
Otherwise, it is not supported
Commonly nvarchar(max) is mapped to SQL_WLONGVARCHAR and this data type can only be mapped to Oracle if the Oracle database character set is unicode.
To check the database character set, please excuet:
select * from nls_parameters;
and have a look at: NLS_CHARACTERSET
UPDATE
NLS_CHARACTERSET needs to be a unicode character set - for example AL32UTF8(Do this if you know what you are doing or ask you r DBA to do it.)
NCHAR character set isn't used as the mapping is to Oracle LONG which uses the normal database character set.
A 2nd solution would be to create on the SQL Server side a view that splits the nvarchar(max) to several nvarchar(xxx) and then to select from the view and to concatenate the content again in Oracle.(If you have problem with changing the character set to unicode then this approach is the beset way to go.)

SQL Server Text Datatype Maxlength = 65,535?

Software I'm working with uses a text field to store XML. From my searches online, the text datatype is supposed to hold 2^31 - 1 characters. Currently SQL Server is truncating the XML at 65,535 characters every time. I know this is caused by SQL Server, because if I add a 65,536th character to the column directly in Management Studio, it states that it will not update because characters will be truncated.
Is the max length really 65,535 or could this be because the database was designed in an earlier version of SQL Server (2000) and it's using the legacy text datatype instead of 2005's?
If this is the case, will altering the datatype to Text in SQL Server 2005 fix this issue?

that is a limitation of SSMS not of the text field, but you should use varchar(max) since text is deprecated
Here is also a quick test
create table TestLen (bla text)
insert TestLen values (replicate(convert(varchar(max),'a'), 100000))
select datalength(bla)
from TestLen
Returns 100000 for me

MSSQL 2000 should allow up to 2^31 - 1 characters (non unicode) in a text field, which is over 2 billion. Don't know what's causing this limitation but you might wanna try using varchar(max) or nvarchar(max). These store as many characters but allow also the regular string T-SQL functions (like LEN, SUBSTRING, REPLACE, RTRIM,...).

If you're able to convert the column, you might as well, since the text data type will be removed in a future version of SQL Server. See here.
The recommendation is to use varchar(MAX) or nvarchar(MAX). In your case, you could also use the XML data type, but that may tie you to certain database engines (if that's a consideration).

You should have a look at
XML Support in Microsoft SQL Server
2005
Beginning SQL Server 2005 XML
Programming
So I would rather try to use the data type appropriate for the use. Not make a datatype fit your use from a previous version.

Here's a little script I wrote for getting out all data
SELECT #data = N'huge data';
DECLARE #readSentence NVARCHAR (MAX) = N'';
DECLARE #dataLength INT = ( SELECT LEN (#data));
DECLARE #currIndex INT = 0;
WHILE #data <> #readSentence
BEGIN
DECLARE #temp NVARCHAR (MAX) = N'';
SET #temp = ( SELECT SUBSTRING (#data, #currIndex, 65535));
SELECT #temp;
SET #readSentence += #temp;
SET #currIndex += 65535;
END;

How do I set database default Encoding?

How do I set the the default encoding of my local file SQL-Server database?
Thanks.
EDIT: Removed the UTF-8

Assuming by encoding you mean collation, you can change the default for new databases like:
alter database model collate SQL_Latin1_General_CP1_CI_AS
The change the collation of an existing database:
alter database YourDbName collate SQL_Latin1_General_CP1_CI_AS
The list of available collations is returned by a system function:
select * from fn_helpcollations()

SQL Server does not support UTF-8.
Use nvarchar (UTF-16) if you don't mind using double space or varbinary if you don't need sort/indexing/comparison.
It may be supported in next release of SQL Server according to this post

/*!40101 SET NAMES utf8 */
Something like that

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to Use UTF-8 Collation in SQL Server database? - sql-server

Related

German Umlaut hash - SHA256 on SQL server

How to ensure specific character encoding in Microsoft SQL Server?

Missing nvarchar columns when reading SQL Server database table from Oracle

SQL Server Text Datatype Maxlength = 65,535?

How do I set database default Encoding?

Categories

Resources