Convert utf-8 encoded varbinary(max) data to nvarchar(max) string - sql-server

Is there a simple way to convert a utf-8 encoded varbinary(max) column to varchar(max) in T-SQL. Something like CONVERT(varchar(max), [MyDataColumn]). Best would be a solution that does not need custom functions.
Currently, i convert the data on the client side, but this has the downside, that correct filtering and sorting is not as efficient as done server-side.

XML trick
Following solution should work for any encoding.
There is a tricky way of doing exactly what the OP asks. Edit: I found the same method discussed on SO (SQL - UTF-8 to varchar/nvarchar Encoding issue)
The process goes like this:
SELECT
CAST(
'<?xml version=''1.0'' encoding=''utf-8''?><![CDATA[' --start CDATA
+ REPLACE(
LB.LongBinary,
']]>', --we need only to escape ]]>, which ends CDATA section
']]]]><![CDATA[>' --we simply split it into two CDATA sections
) + ']]>' AS XML --finish CDATA
).value('.', 'nvarchar(max)')
Why it works: varbinary and varchar are the same string of bits - only the interpretation differs, so the resulting xml truly is utf8 encoded bitstream and the xml interpreter is than able to reconstruct the correct utf8 encoded characters.
BEWARE the 'nvarchar(max)' in the value function. If you used varchar, it would destroy multi-byte characters (depending on your collation).
BEWARE 2 XML cannot handle some characters, i.e. 0x2. When your string contains such characters, this trick will fail.
Database trick (SQL Server 2019 and newer)
This is simple. Create another database with UTF8 collation as the default one. Create function that converts VARBINARY to VARCHAR. The returned VARCHAR will have that UTF8 collation of the database.
Insert trick (SQL Server 2019 and newer)
This is another simple trick. Create a table with one column VARCHAR COLLATE ...UTF8. Insert the VARBINARY data into this table. It will get saved correctly as UTF8 VARCHAR. It is sad that memory optimized tables cannot use UTF8 collations...
Alter table trick (SQL Server 2019 and newer)
(don't use this, it is unnecessary, see Plain insert trick)
I was trying to come up with an approach using SQL Server 2019's Utf8 collation and I have found one possible method so far, that should be faster than the XML trick (see below).
Create temporary table with varbinary column.
Insert varbinary values into the table
Alter table alter column to varchar with utf8 collation
drop table if exists
#bin,
#utf8;
create table #utf8 (UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8);
create table #bin (BIN VARBINARY(MAX));
insert into #utf8 (UTF8) values ('Žluťoučký kůň říčně pěl ďábelské ódy za svitu měsíce.');
insert into #bin (BIN) select CAST(UTF8 AS varbinary(max)) from #utf8;
select * from #utf8; --here you can see the utf8 string is stored correctly and that
select BIN, CAST(BIN AS VARCHAR(MAX)) from #bin; --utf8 binary is converted into gibberish
alter table #bin alter column BIN varchar(max) collate Czech_100_CI_AI_SC_UTF8;
select * from #bin; --voialá, correctly converted varchar
alter table #bin alter column BIN nvarchar(max);
select * from #bin; --finally, correctly converted nvarchar
Speed difference
The Database trick together with the Insert trick are the fastest ones.
The XML trick is slower.
The Alter table trick is stupid, don't do it. It loses out when you change lots of short texts at once (the altered table is large).
The test:
first string contains one replace for the XML trick
second string is plain ASCII with no replaces for XML trick
#TextLengthMultiplier determines length of the converted text
#TextAmount determines how many of them at once will be converted
------------------
--TEST SETUP
--DECLARE #LongText NVARCHAR(MAX) = N'český jazyk, Tiếng Việt, русский язык, 漢語, ]]>';
--DECLARE #LongText NVARCHAR(MAX) = N'JUST ASCII, for LOLZ------------------------------------------------------';
DECLARE
#TextLengthMultiplier INTEGER = 100000,
#TextAmount INTEGER = 10;
---------------------
-- TECHNICALITIES
DECLARE
#StartCDATA DATETIME2(7), #EndCDATA DATETIME2(7),
#StartTable DATETIME2(7), #EndTable DATETIME2(7),
#StartDB DATETIME2(7), #EndDB DATETIME2(7),
#StartInsert DATETIME2(7), #EndInsert DATETIME2(7);
drop table if exists
#longTexts,
#longBinaries,
#CDATAConverts,
#DBConverts,
#INsertConverts;
CREATE TABLE #longTexts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #longBinaries (LongBinary VARBINARY(MAX) NOT NULL);
CREATE TABLE #CDATAConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #DBConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #InsertConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
insert into #longTexts --make the long text longer
(LongText)
select
REPLICATE(#LongText, #TextLengthMultiplier)
from
TESTES.dbo.Numbers --use while if you don't have number table
WHERE
Number BETWEEN 1 AND #TextAmount; --make more of them
insert into #longBinaries (LongBinary) select CAST(LongText AS varbinary(max)) from #longTexts;
--sanity check...
SELECT TOP(1) * FROM #longTexts;
------------------------------
--MEASURE CDATA--
SET #StartCDATA = SYSDATETIME();
INSERT INTO #CDATAConverts
(
LongText
)
SELECT
CAST(
'<?xml version=''1.0'' encoding=''utf-8''?><![CDATA['
+ REPLACE(
LB.LongBinary,
']]>',
']]]]><![CDATA[>'
) + ']]>' AS XML
).value('.', 'Nvarchar(max)')
FROM
#longBinaries AS LB;
SET #EndCDATA = SYSDATETIME();
--------------------------------------------
--MEASURE ALTER TABLE--
SET #StartTable = SYSDATETIME();
DROP TABLE IF EXISTS #AlterConverts;
CREATE TABLE #AlterConverts (UTF8 VARBINARY(MAX));
INSERT INTO #AlterConverts
(
UTF8
)
SELECT
LB.LongBinary
FROM
#longBinaries AS LB;
ALTER TABLE #AlterConverts ALTER COLUMN UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8;
--ALTER TABLE #AlterConverts ALTER COLUMN UTF8 NVARCHAR(MAX);
SET #EndTable = SYSDATETIME();
--------------------------------------------
--MEASURE DB--
SET #StartDB = SYSDATETIME();
INSERT INTO #DBConverts
(
LongText
)
SELECT
FUNCTIONS_ONLY.dbo.VarBinaryToUTF8(LB.LongBinary)
FROM
#longBinaries AS LB;
SET #EndDB = SYSDATETIME();
--------------------------------------------
--MEASURE Insert--
SET #StartInsert = SYSDATETIME();
INSERT INTO #INsertConverts
(
LongText
)
SELECT
LB.LongBinary
FROM
#longBinaries AS LB;
SET #EndInsert = SYSDATETIME();
--------------------------------------------
-- RESULTS
SELECT
DATEDIFF(MILLISECOND, #StartCDATA, #EndCDATA) AS CDATA_MS,
DATEDIFF(MILLISECOND, #StartTable, #EndTable) AS ALTER_MS,
DATEDIFF(MILLISECOND, #StartDB, #EndDB) AS DB_MS,
DATEDIFF(MILLISECOND, #StartInsert, #EndInsert) AS Insert_MS;
SELECT TOP(1) '#CDATAConverts ', * FROM #CDATAConverts ;
SELECT TOP(1) '#DBConverts ', * FROM #DBConverts ;
SELECT TOP(1) '#INsertConverts', * FROM #INsertConverts;
SELECT TOP(1) '#AlterConverts ', * FROM #AlterConverts ;

SQL-Server does not know UTF-8 (at least all versions you can use productivly). There is limited support starting with v2014 SP2 (and some details about the supported versions)
when reading an utf-8 encoded file from disc via BCP (same for writing content to disc).
Important to know:
VARCHAR(x) is not utf-8. It is 1-byte-encoded extended ASCII, using a codepage (living in the collation) as character map.
NVARCHAR(x) is not utf-16 (but very close to it, it's ucs-2). This is a 2-byte-encoded string covering almost any known characters (but exceptions exist).
utf-8 will use 1 byte for plain latin characters, but 2 or even more bytes to encoded foreign charsets.
A VARBINARY(x) will hold the utf-8 as a meaningless chain of bytes.
A simple CAST or CONVERT will not work: VARCHAR will take each single byte as a character. For sure this is not the result you would expect. NVARCHAR would take each chunk of 2 bytes as one character. Again not the thing you need.
You might try to write this out to a file and read it back with BCP (v2014 SP2 or higher). But the better chance I see for you is a CLR function.

you can use the following to post string into varbinary field
Encoding.Unicode.GetBytes(Item.VALUE)
then use the following to retrive data as string
public string ReadCString(byte[] cString)
{
var nullIndex = Array.IndexOf(cString, (byte)0);
nullIndex = (nullIndex == -1) ? cString.Length : nullIndex;
return System.Text.Encoding.Unicode.GetString(cString);
}

Related

How to split a varbinary(max) into a list of ints? (and the other way)

I'm curently storing a list of ids in a column as a CSV string value ('1;2;3').
I'd like to optimize with a better approach (I believe) which would use varbinary(max).
I'm looking for tsql functions
1 . That would merge side by side a set of integer rows into a varbinary(max)
2 . That would split the varbinary field into a set of int rows
Any tips appreciated, thanks
Solution is very questionable. I'd also suggest to normalize the data.
However, if you still want to store your data as VARBINARY, here is the solution:
CREATE FUNCTION dbo.fn_String_to_Varbinary(#Input VARCHAR(MAX))
RETURNS VARBINARY(MAX) AS
BEGIN
DECLARE #VB VARBINARY(MAX);
WITH CTE as (
SELECT CAST(CAST(LEFT(IT,CHARINDEX(';',IT)-1) AS INT) as VARBINARY(MAX)) as VB, RIGHT(IT,LEN(IT) - CHARINDEX(';',IT)) AS IT
FROM (VALUES (#Input)) as X(IT) union all
SELECT VB + CAST(CAST(LEFT(IT,CHARINDEX(';',IT)-1) AS INT) as VARBINARY(MAX)) as VB, RIGHT(IT,LEN(IT) - CHARINDEX(';',IT)) AS IT FROM CTE WHERE LEN(IT) > 1
)
SELECT TOP 1 #VB = VB FROM CTE
ORDER BY LEN(VB) DESC
RETURN #VB
END
GO
DECLARE #Input VARCHAR(MAX) = '421;1;2;3;5000;576;842;375;34654322;18;67;000001;1232142334;'
DECLARE #Position INT = 9
DECLARE #VB VARBINARY(MAX)
SELECT #VB = dbo.fn_String_to_Varbinary(#Input)
SELECT #VB, CAST(SUBSTRING(#VB,4*(#Position-1)+1,4) AS INT)
GO
The function converts string into VARBINARY and then script extracts 9th number from that VARBINARY value.
Do not run this function against a data set with million records and million numbers in each line.

how to read varbinary data in xml in sql server

i've data in XML like
<Values>
<Id>7f8a5d20-d171-42f5-a222-a01b5186a048</Id>
<DealAttachment>
<AttachmentId>deefff3f-f63e-4b4c-8e76-68b6db476628</AttachmentId>
<IsNew>true</IsNew>
<IsDeleted>false</IsDeleted>
<DealId>7f8a5d20-d171-42f5-a222-a01b5186a048</DealId>
<AttachmentName>changes2</AttachmentName>
<AttachmentFile>991049711010310132116104101321011099710510832115117981061019911645100111110101131011711210097116101329811110012132116101120116321151171031031011151161011003298121326610510810845100111110101131010011132110111116321151011101003210110997105108321051103299971151013211110232117112100971161011003299108105101110116115329811711632115101110100321051103299971151013211110232100101108101116101100329910810510111011611545321001111101011310991141019711610132973211510111297114971161013211697981081013211611132991111101169710511032117115101114115321161113211910411110932101109971051083211910510810832981013211510111011645100111110101131011611411711099971161013211610410132108111103321169798108101329710211610111432111110101321091111101161044510011111010113108411497110115108971161051111101154532100111110101131097100100321129711497109101116101114321161113210611198321161113211411711032115105108101110116108121451001011089712110110013101310131013101310131010511010211111410932981051081083210511032999711510132981111161043211711210097116105111110471001011081011161051111104432117112100971161051111103211910510810832110111116329810132105110102111114109101100</AttachmentFile>
<AttachmentType>.txt</AttachmentType>
</DealAttachment>
</Values>
where AttachmentFile is varbinary(max)
DECLARE #AttachmentId uniqueidentifier,
#DealId uniqueidentifier,
#IsNew bit,
#IsDeleted bit,
#AttachmentName varchar(100),
#AttachmentFile varbinary(max),
#AttachmentType varchar(50)
SET #DealId = #SubmitXml.value('(Values/Id/node())[1]', 'uniqueidentifier')
SET #AttachmentId = #SubmitXml.value('(Values/DealAttachment/AttachmentId/node())[1]', 'uniqueidentifier')
SET #IsNew = #SubmitXml.value('(Values/DealAttachment/IsNew/node())[1]', 'bit')
SET #IsDeleted = #SubmitXml.value('(Values/DealAttachment/IsDeleted/node())[1]', 'bit')
SET #AttachmentName = #SubmitXml.value('(Values/DealAttachment/AttachmentName/node())[1]', 'varchar(100)')
SET #AttachmentFile = #SubmitXml.value('(Values/DealAttachment/AttachmentFile/node())[1]', 'varbinary(max)')
SET #AttachmentType = #SubmitXml.value('(Values/DealAttachment/AttachmentType/node())[1]', 'varchar(50)')
But, after above statement #AttachmentFile is NULL or blankspace.
Binary data types in SQL Server (including varbinary) are represented as hexadecimal in queries which read and write them.
I think the problem here is that, rather than writing to the database directly from the byte stream (as in the example you linked to in your comment, which would implicitly cast the byte array to a hexadecimal value), it's being written to an intermediate XML block. When the data is written to the XML block, it looks like the byte stream is being converted to a string made up of concatenated list of integers of the byte values in decimal. Because the byte values are not delimited, it might not be possible to reconstruct the original data from this stream.
If you need to use an intermediate XML file, a more conventional approach would be to encode the file data as Base64 in the XML block (as discussed in this question - and doubtless many others). You could then decode it using the xs:base64Binary function:
SET #AttachmentFile = #SubmitXml.value('xs:base64Binary((Values/DealAttachment/AttachmentFile/node())[1])','varbinary(max)')
#Ed is correct in that you have somehow stored the decimal ASCII values for each character instead of the hex value for each. You can see that by decoding each one:
SELECT CHAR(99) + CHAR(104) + CHAR(97) + CHAR(110) + CHAR(103) + CHAR(101) +
CHAR(32) + CHAR(116) + CHAR(104) + CHAR(101);
-- change the
But as you can also see, there is no way to decode that programmatically because it is a mix of 2 digit and 3 digit values.
If you really were storing hex bytes in that XML element, you could turn it into VARBINARY(MAX) having the same binary bytes by first extracting it from the XML as a plain VARCHAR(MAX) and then converting it to VARBINARY(MAX) using the CONVERT built-in function, specifying a "style" of 2.
SELECT CONVERT(VARBINARY(MAX), 'this is a test, yo!');
-- 0x74686973206973206120746573742C20796F21
DECLARE #SomeXML XML = N'
<Values>
<DealAttachment>
<AttachmentFile>74686973206973206120746573742C20796F21</AttachmentFile>
</DealAttachment>
</Values>';
SELECT CONVERT(VARBINARY(MAX),
#SomeXML.value('(/Values/DealAttachment/AttachmentFile/text())[1]',
'VARCHAR(MAX)'),
2)
-- 0x74686973206973206120746573742C20796F21 (but as VARBINARY)
However, that all being said, ideally you would just Base64 encode the binary file on the way in (also as mentioned by #Ed) using Convert.ToBase64String (since you already have a byte[]).

Detect UNICODE characters that are not ASCII in table

I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?
You can find non-ASCII characters quite simply:
SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))
If you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as a row that can't convert to VARCHAR. For example:
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2
Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.
Please note that only MAX types are used. Do not use NVARCHAR(4000) or VARCHAR(4000) else you might get false positives due to truncation of data in NVARCHAR(MAX) columns.
So, in terms of the example code in the question, the query would be (assuming that a Latin1_General collation is being used):
SELECT usr.*
FROM dbo.[users] usr
WHERE usr.[name] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR usr.[address] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR usr.[description] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));
There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii and then pass the result to char and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:
declare #i int = 1
declare #x nvarchar(10) = N'vsdǣf'
declare #result nvarchar(100) = N''
while (#i < len(#x))
begin
if char(ascii(substring(#x,#i,1))) = '?'
begin
set #result = #result + substring(#x,#i,1)
end
set #i = #i+1
end
select #result

Short guid in SQL Server / converting from a character string to uniqueidentifier

I need to create a column witch will contain short guid. So I found out something like this:
alter table [dbo].[Table]
add InC UNIQUEIDENTIFIER not null default LEFT(NEWID(),6)
But I get the error:
Conversion failed when converting from a character string to uniqueidentifier.
I've been trying
LEFT(CONVERT(varchar(36),NEWID()),6)
and
CONVERT(UNIQUEIDENTIFIER,LEFT(CONVERT(varchar(36),NEWID()),6))
But I am still getting the same error.
There is no such thing as "short guid". Guid, or uniqueidentifier is a 16 byte data type. You can read about it in MSDN. It means that the length must always be 16 bytes and you cannot use 6 characters as you are trying to do.
In the same MSDN article you can find description how you can initialize this type:
A column or local variable of uniqueidentifier data type can be
initialized to a value in the following ways:
By using the NEWID function.
By converting from a string constant in the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, in which each x is a
hexadecimal digit in the range 0-9 or a-f. For example,
6F9619FF-8B86-D011-B42D-00C04FC964FF is a valid uniqueidentifier
value.
In your case you are trying to convert only 6 characters to uniqueidentifier which obviously fails.
If you want to use just 6 characters, just use varchar(6):
alter table [dbo].[Table]
add InC varchar(6) not null default LEFT(NEWID(),6)
Keep in mind that in this case this guid is not guaranteed to be unique.
Using CRYPT_GEN_RANDOM instead of NEWID can improve random distribution of the string.
SELECT LEFT(CAST(CAST(CRYPT_GEN_RANDOM(16) AS UNIQUEIDENTIFIER) AS VARCHAR(50)), 6)
I just made this one since I couldn't find a good answer on the internet.
Please keep in mind this is a 64 bit representation of a 128bit value, so it has twice the collision possibilities that a real GUID would have. Does not handle 0.
Function takes a NEWID value: 6A10A273-4561-40D8-8D36-4D3B37E4A19C
and shortens it to : 7341xIlZseT
DECLARE #myid uniqueidentifier= NEWID()
select #myid
DECLARE #bigintdata BIGINT = cast(cast(reverse(NEWID()) as varbinary(max)) as bigint)
DECLARE #charSet VARCHAR(70) = '1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
DECLARE #cBase int = LEN(#charSet)
DECLARE #sUID varchar(22) = ''
DECLARE #x int
WHILE (#bigintdata <> 0)
BEGIN
SET #x = CAST(#bigintdata % #cBase as INT) + 1
SET #bigintdata = #bigintdata / #cBase
SET #sUID = SUBSTRING(#charSet, #x, 1) + #sUID;
END
SELECT #sUID

How VARCHAR/CHAR manages to store/render multinational symbols in SQL Server?

I have used to read that varchar (char) is used for storing ASCII characters with 1 bute per character while nvarchar (varchar) uses UNICODE with 2 bytes.
But which ASCII? In SSMS 2008 R2
DECLARE #temp VARCHAR(3); --CHAR(3)
SET #temp = 'ЮЯç'; --cyryllic + portuguese-specific letters
select #temp,datalength(#temp)
-- results in
-- ЮЯç 3
Update: Ooops, the result was really ЮЯс but not ЮЯç. Thanks, Martin
declare #table table
(
c1 char(4) collate Cyrillic_General_CS_AI,
c2 char(4) collate Latin1_General_100_CS_AS_WS
)
INSERT INTO #table VALUES (N'ЮЯçæ', N'ЮЯçæ')
SELECT c1,cast(c1 as binary(4)) as c1bin, c2, cast(c2 as binary(4)) as c2bin
FROM #table
Returns
c1 c1bin c2 c2bin
---- ---------- ---- ----------
ЮЯc? 0xDEDF633F ??çæ 0x3F3FE7E6
You can see that dependant upon the collation non ASCII characters can get lost or silently converted to near equivalents.
It's ASCII with a codepage which defines the upper 128 characters (128-255). This is controlled by the "collation" in SQL Server, and depending on the collation you use you can use a subset of "special" characters.
See this MSDN page.

Resources