Convert UTF-8 varbinary(max) to varchar(max) - sql-server

I have a varbinary(max) column with UTF-8-encoded text that has been compressed. I would like to decompress this data and work with it in T-SQL as a varchar(max) using the UTF-8 capabilities of SQL Server.
I'm looking for a way of specifying the encoding when converting from varbinary(max) to varchar(max). The only way I've managed to do that is by creating a table variable with a column with a UTF-8 collation and inserting the varbinary data into it.
DECLARE #rv TABLE(
Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO #rv
SELECT SUBSTRING(Decompressed, 4, DATALENGTH(Decompressed) - 3) WithoutBOM
FROM
(SELECT DECOMPRESS(RawResource) AS Decompressed FROM Resource) t
I'm wondering if there is a more elegant and efficient approach that does not involve inserting into a table variable.
UPDATE:
Boiling this down to a simple example that doesn't deal with byte order marks or compression:
I have the string "Hello 😊" UTF-8 encoded without a BOM stored in variable #utf8Binary
DECLARE #utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A
Now I try to assign that into various char-based variables and print the result:
DECLARE #brokenVarChar varchar(max) = CONVERT(varchar(max), #utf8Binary)
print '#brokenVarChar = ' + #brokenVarChar
DECLARE #brokenNVarChar nvarchar(max) = CONVERT(varchar(max), #utf8Binary)
print '#brokenNVarChar = ' + #brokenNVarChar
DECLARE #rv TABLE(
Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO #rv
select #utf8Binary
DECLARE #working nvarchar(max)
Select TOP 1 #working = Res from #rv
print '#working = ' + #working
The results of this are:
#brokenVarChar = Hello 😊
#brokenNVarChar = Hello 😊
#working = Hello 😊
So I am able to get the binary result properly decoded using this indirect method, but I am wondering if there is a more straightforward (and likely efficient) approach.

I don't like this solution, but it's one I got to (I initially thought it wasn't working, due to what appears to be a bug in ADS). One method would be to create a new database in a UTF8 collation, and then pass the value to a function in that database. As the database is in a UTF8 collation, the default collation will be different to the local one, and the correct result will be returned:
CREATE DATABASE UTF8 COLLATE Latin1_General_100_CI_AS_SC_UTF8;
GO
USE UTF8;
GO
CREATE OR ALTER FUNCTION dbo.Bin2UTF8 (#utfbinary varbinary(MAX))
RETURNS varchar(MAX) AS
BEGIN
RETURN CAST(#utfbinary AS varchar(MAX));
END
GO
USE YourDatabase;
GO
SELECT UTF8.dbo.Bin2UTF8(0x48656C6C6F20F09F988A);
This, however, isn't particularly "pretty".

There is an undocumented hack:
DECLARE #utf8 VARBINARY(MAX)=0x48656C6C6F20F09F988A;
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',#utf8,']]>') AS XML)
.value('.','nvarchar(max)');
The result
Hello 😊
This works even in versions without the new UTF8 collations...
UPDATE: calling this as a function
This can easily be wrapped in a scalar function
CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(#utfBinary VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
RETURN
(
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',#utfBinary,']]>') AS XML)
.value('.','nvarchar(max)')
);
END
GO
Or like this as an inlined table valued function
CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(#utfBinary VARBINARY(MAX))
RETURNS TABLE
AS
RETURN
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',#utfBinary,']]>') AS XML)
.value('.','nvarchar(max)') AS ConvertedString
GO
This can be used after FROM or - more appropriate - with APPLY

DECLARE #utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A;
DECLARE #brokenNVarChar nvarchar(max) = concat(#utf8Binary, '' COLLATE Latin1_General_100_CI_AS_SC_UTF8);
print '#brokenNVarChar = ' + #brokenNVarChar;

You didn't say how your data is compressed or what compression algorithm was used. But if you are using the COMPRESS function in SQL Server 2016 or later, you can use the DECOMPRESS function and then cast it to a VARCHAR(MAX). Both COMPRESS and DECOMPRESS use the GZip compression algorithm.
This function will decompress an input expression value, using the GZIP algorithm. DECOMPRESS will return a byte array (VARBINARY(MAX) type).
CAST(DECOMPRESS([compressed content here]) AS VARCHAR(MAX))
See: COMPRESS (Transact-SQL) and DECOMPRESS (Transact-SQL)

Related

TSQL "Illegal XML Character" When Converting Varbinary to XML

I'm trying to create a stored procedure in SQL Server 2016 that converts XML that was previously converted into Varbinary back into XML, but getting an "Illegal XML character" error when converting. I've found a workaround that seems to work, but I can't actually figure out why it works, which makes me uncomfortable.
The stored procedure takes data that was converted to binary in SSIS and inserted into a varbinary(MAX) column in a table and performs a simple
CAST(Column AS XML)
It worked fine for a long time, and I only began seeing an issue when the initial XML started containing an ® (registered trademark) symbol.
Now, when I attempt to convert the binary to XML I get this error
Msg 9420, Level 16, State 1, Line 23
XML parsing: line 1, character 7, illegal xml character
However, if I first convert the binary to varchar(MAX), then convert that to XML, it seems to work fine. I don't understand what is happening when I perform that intermediate CAST that is different than casting directly to XML. My main concern is that I don't want to add it in to account for this scenario and end up with unintended consequences.
Test code:
DECLARE #foo VARBINARY(MAX)
DECLARE #bar VARCHAR(MAX)
DECLARE #Nbar NVARCHAR(MAX)
--SELECT Varbinary
SET #foo = CAST( '<Test>®</Test>' AS VARBINARY(MAX))
SELECT #foo AsBinary
--select as binary as varchar
SET #bar = CAST(#foo AS VARCHAR(MAX))
SELECT #bar BinaryAsVarchar -- Correct string output
--select binary as nvarchar
SET #nbar = CAST(#foo AS NVARCHAR(MAX))
SELECT #nbar BinaryAsNvarchar -- Chinese characters
--select binary as XML
SELECT TRY_CAST(#foo AS XML) BinaryAsXML -- ILLEGAL XML character
-- SELECT CONVERT(xml, #obfoo) BinaryAsXML --ILLEGAL XML Character
--select BinaryAsVarcharAsXML
SELECT TRY_CAST(#bar AS XML) BinaryAsVarcharAsXML -- Correct Output
--select BinaryAsNVarcharAsXML
SELECT TRY_CAST(#nbar AS XML) BinaryAsNvarcharAsXML -- Chinese Characters
There are several things to know:
SQL-Server is rather limited with character encodings. There is VARCHAR, which is 1-byte-encoded extended ASCII and NVARCHAR, which is UCS-2 (almost the same as utf-16).
VARCHAR uses plain latin for the first set of characters and a codepage-mapping provided by the collation in use for the second set.
VARCHAR is not utf-8. utf-8 works with VARCHAR, as long as all characters are 1-byte-enocded. But utf-8 knows a lot of 2-byte-enocded (up to 4-byte-enocded) characters, which would break the internal storage of a VARCHAR string.
NVARCHAR will work with almost any 2-byte encoded character natively (that means with almost any existing character). But it is not exactly utf-16 (there are 3-byte encoded characters, which would break SQL-Servers internal storage).
XML is not stored as the XML-string you see, but as an hierarchically organised physical table, based on NVARCHAR values.
The natively stored XML is really fast, while any text-based storage will need a very expensive parse-operation in advance (over and over...).
Storing XML as string is bad, storing XML as VARCHAR string is even worse.
Storing a VARCHAR-string-XML as VARBINARY is a cummulation of things you should not do.
Try this:
DECLARE #text1Byte VARCHAR(100)='<test>blah</test>';
DECLARE #text2Byte NVARCHAR(100)=N'<test>blah</test>';
SELECT CAST(#text1Byte AS VARBINARY(MAX)) AS text1Byte_Binary
,CAST(#text2Byte AS VARBINARY(MAX)) AS text2Byte_Binary
,CAST(#text1Byte AS XML) AS text1Byte_XML
,CAST(#text2Byte AS XML) AS text2Byte_XML
,CAST(CAST(#text1Byte AS VARBINARY(MAX)) AS XML) AS text1Byte_XML_via_Binary
,CAST(CAST(#text2Byte AS VARBINARY(MAX)) AS XML) AS text2Byte_XML_via_Binary
The only difference you'll see are the many zeros in 0x3C0074006500730074003E0062006C00610068003C002F0074006500730074003E00. This is due to the 2-byte-encoding of nvarchar, each second byte is not needed in this sample. But if you'd need far-east-characters the picture would be completely different.
The reason why it works: SQL-Server is very smart. The cast from the variable to XML is rather easy, as the engine knows, that the underlying variable is varchar or nvarchar. But the last two casts are different. The engine has to examine the binary, whether it is a valid nvarchar and will give it a second try with varchar if it fails.
Now try to add your registered trademark to the given example. Add it first to the second variable DECLARE #text2Byte NVARCHAR(100)=N'<test>blah®</test>'; and try to run this. Then add it to the first variable and try it again.
What you can try:
Cast your binary to varchar(max), then to nvarchar(max) and finally to xml.
,CAST(CAST(CAST(CAST(#text1Byte AS VARBINARY(MAX)) AS VARCHAR(MAX)) AS NVARCHAR(MAX)) AS XML) AS text1Byte_XML_via_Binary
This will work, but it won't be fast...

How to pass multiple values in a single parameter for a stored procedures?

I want to pass multiple values in a single parameter. SQL Server 2005
You can have your sproc take an xml typed input variable, then unpack the elements and grab them. For example:
DECLARE #XMLData xml
DECLARE
#Code varchar(10),
#Description varchar(10)
SET #XMLData =
'
<SomeCollection>
<SomeItem>
<Code>ABCD1234</Code>
<Description>Widget</Description>
</SomeItem>
</SomeCollection>
'
SELECT
#Code = SomeItems.SomeItem.value('Code[1]', 'varchar(10)'),
#Description = SomeItems.SomeItem.value('Description[1]', 'varchar(100)')
FROM #XMLDATA.nodes('//SomeItem') SomeItems (SomeItem)
SELECT #Code AS Code, #Description AS Description
Result:
Code Description
========== ===========
ABCD1234 Widget
You can make a function:
ALTER FUNCTION [dbo].[CSVStringsToTable_fn] ( #array VARCHAR(8000) )
RETURNS #Table TABLE ( value VARCHAR(100) )
AS
BEGIN
DECLARE #separator_position INTEGER,
#array_value VARCHAR(8000)
SET #array = #array + ','
WHILE PATINDEX('%,%', #array) <> 0
BEGIN
SELECT #separator_position = PATINDEX('%,%', #array)
SELECT #array_value = LEFT(#array, #separator_position - 1)
INSERT #Table
VALUES ( #array_value )
SELECT #array = STUFF(#array, 1, #separator_position, '')
END
RETURN
END
and select from it:
DECLARE #LocationList VARCHAR(1000)
SET #LocationList = '1,32'
SELECT Locations
FROM table
WHERE LocationID IN ( SELECT CAST(value AS INT)
FROM dbo.CSVStringsToTable_fn(#LocationList) )
OR
SELECT Locations
FROM table loc
INNER JOIN dbo.CSVStringsToTable_fn(#LocationList) list
ON CAST(list.value AS INT) = loc.LocationID
Which is extremely helpful when you attempt to send a multi-value list from SSRS to a PROC.
Edited: to show that you may need to CAST - However be careful to control what is sent in the CSV list
Just to suggest. You can't really do so in SQL Server 2005. At least there is no a straightforward way. You have to use CSV or XML or Base 64 or JSON. However I strongly discourage you to do so since all of them are error prone and generate really big problems.
If you are capable to switch to SQL Server 2008 you can use Table valued parameters (Reference1, Reference2).
If you cannot I'd suggest you to consider the necessity of doing it in stored procedure, i.e. do you really want (should/must) to perform the sql action using SP. If you are solving a problem just use Ad hoc query. If you want to do so in education purposes, you might try don't even try the above mentioned things.
There are multiple ways you can achieve this, by:
Passing CSV list of strings as an argument to a (N)VARCHAR parameter, then parsing it inside your SP, check here.
Create a XML string first of all, then pass it as an XML datatype param. You will need to parse the XML inside the SP, you may need APPLY operator for this, check here.
Create a temp table outside the SP, insert the multiple values as multiple rows, no param needed here. Then inside the SP use the temp table, check here.
If you are in 2008 and above try TVPs (Table Valued Parameters) and pass them as params, check here.

Store such characters in SQL Server 2008 R2

I'm storing encrypted passwords in the database, It worked perfect so far on MachineA. Now that I moved to MachineB it seems like the results gets corrupted in the table.
For example: ù9qÆæ\2 Ý-³Å¼]ó will change to ?9q??\2 ?-³?¼]? in the table.
That's the query I use:
ALTER PROC [Employees].[pRegister](#UserName NVARCHAR(50),#Password VARCHAR(150))
AS
BEGIN
DECLARE #Id UNIQUEIDENTIFIER
SET #Id = NEWID()
SET #password = HashBytes('MD5', #password + CONVERT(VARCHAR(50),#Id))
SELECT #Password
INSERT INTO Employees.Registry (Id,[Name],[Password]) VALUES (#Id, #UserName,#Password)
END
Collation: SQL_Latin1_General_CP1_CI_AS
ProductVersion: 10.50.1600.1
Thanks
You are mixing 2 datatypes:
password need to be nvarchar to support non-Western European characters
literals need N prefix
Demo:
DECLARE #pwdgood nvarchar(150), #pwdbad varchar(150)
SET #pwdgood = N'ù9qÆæ\2 Ý-³Å¼]ó'
SET #pwdbad = N'?9q??\2 ?-³?¼]?'
SELECT #pwdgood, #pwdbad
HashBytes gives varbinary(8000) so you need this in the table
Note: I'd also consider salting the stored password with something other than ID column for that row
If you want to store such characters, you need to:
use NVARCHAR as the datatype for your columns and parameters (#Password isn't NVARCHAR and the CAST you're using to assign the password in the database table isn't using NVARCHAR either, in your sample ...)
use the N'....' syntax for indicating Unicode string literals
With those two in place, you should absolutely be able to store and retrieve any valid Unicode character

How to convert varbinary to GUID in TSQL stored procedure?

how can I convert the HASHBYTES return value to a GUID?
This is what I have so far.
CREATE PROCEDURE [dbo].[Login]
#email nvarchar,
#password varchar
AS
BEGIN
DECLARE #passHashBinary varbinary;
DECLARE #newPassHashBinary varbinary;
-- Create a unicode (utf-16) password
Declare #unicodePassword nvarchar;
Set #unicodePassword = CAST(#password as nvarchar);
SET #passHashBinary = HASHBYTES('md5', #password);
SET #newPassHashBinary = HASHBYTES('md5', #unicodePassword);
Simply cast it:
select cast(hashbytes('md5','foo') as uniqueidentifier)
But there are two questions lingering:
why cast HASHBYTES to guid? Why not use the appropriate type for storage, namely BINARY(16)
I hope you are aware that MD5 hashing passwords is basically useless, right? Because of rainbow tables. You need to use a secure hashing scheme, like an HMAC or the HA1 of Digest.

How can I use sp_xml_preparedocument on result of NTEXT query in SQL 2000?

I know NTEXT is going away and that there are larger best-practices issues here (like storing XML in an NTEXT column), but I have a table containing XML from which I need to pluck a attribute value. This should be easy to do using sp_xml_preparedocument but is made more tricky by the fact that you cannot declare a local variable of type NTEXT and I cannot figure out how to use an expression to specify the XML text passed to the function. I can do it like this in SQL 2005 because of the XML or VARCHAR(MAX) datatypes, but what can I do for SQL 2000?
DECLARE #XmlHandle int
DECLARE #ProfileXml xml
SELECT #ProfileXml = ProfileXml FROM ImportProfile WHERE ProfileId = 1
EXEC sp_xml_preparedocument #XmlHandle output, #ProfileXml
-- Pluck the Folder TemplateId out of the FldTemplateId XML attribute.
SELECT FolderTemplateId
FROM OPENXML( #XmlHandle, '/ImportProfile', 1)
WITH(
FolderTemplateId int '#FldTemplateId' )
EXEC sp_xml_removedocument #XmlHandle
The only thing I can come up with for SQL 2000 is to use varchar(8000). Is there really no way to use an expression like the following?
EXEC sp_xml_preparedocument #XmlHandle output, (SELECT ProfileXml FROM ImportProfile WHERE ProfileId = 1)
Great question.. but no solution
Thoughts:
You can't wrap the SELECT call in a UDF (to create a kind of dummy ntext local var)
You can't wrap the sp_xml_preparedocument call in a scalar UDF (to use in SELECT) because you can't call extended stored procs
You can't concatenate a call to run dynamically because you'll hit string limits and scop issues
Ditto a self call using OPENQUERY
textptr + READTEXT can't be added as a parameter to sp_xml_preparedocument
So why does sp_xml_preparedocument take ntext as a datatype?

Resources