Detecting Unicode Text in SQL Server - sql-server

I am storing bodies of text in SQL Server.
Some bodies of text contain Unicode characters that will be lost when storing in a VARCHAR column within SQL Server.
As only a small portion of text bodies stored will require a NVARCHAR column, I have decided to create 2 columns, one for VARCHAR text and the other a NVARCHAR text. This way I can save on space by only storing Unicode bodies of text in the NVARCHAR column and the rest in the VARCHAR column.
The question is: how do I detect if a body of text contains Unicode characters so that I can determine the best column to store it in?

You could either determine the 256 characters available in your collation's code page and inspect the string for any characters not in that set or cast it to varchar and then compare it to the nvarchar original.
If you are using code page 1252 then the first approach could be done with
DECLARE #String NVARCHAR(MAX) = N'൯'
SELECT CASE
WHEN #String LIKE '%[^' COLLATE Latin1_General_100_BIN + CHAR(0) + '-' + CHAR(255) + ']%'
THEN 'varchar not OK'
ELSE 'varchar OK'
END
and the second approach...
DECLARE #String NVARCHAR(MAX) = N'൯'
SELECT CASE
WHEN CAST(#String AS VARCHAR(MAX)) = #String
THEN 'varchar OK'
ELSE 'varchar not OK'
END
BTW: If you use row compression you also get Unicode compression thrown in which would largely negate the need for this.

Related

Select more than 65,536 characters from nvarchar(max) in SQL Server [duplicate]

Software I'm working with uses a text field to store XML. From my searches online, the text datatype is supposed to hold 2^31 - 1 characters. Currently SQL Server is truncating the XML at 65,535 characters every time. I know this is caused by SQL Server, because if I add a 65,536th character to the column directly in Management Studio, it states that it will not update because characters will be truncated.
Is the max length really 65,535 or could this be because the database was designed in an earlier version of SQL Server (2000) and it's using the legacy text datatype instead of 2005's?
If this is the case, will altering the datatype to Text in SQL Server 2005 fix this issue?
that is a limitation of SSMS not of the text field, but you should use varchar(max) since text is deprecated
Here is also a quick test
create table TestLen (bla text)
insert TestLen values (replicate(convert(varchar(max),'a'), 100000))
select datalength(bla)
from TestLen
Returns 100000 for me
MSSQL 2000 should allow up to 2^31 - 1 characters (non unicode) in a text field, which is over 2 billion. Don't know what's causing this limitation but you might wanna try using varchar(max) or nvarchar(max). These store as many characters but allow also the regular string T-SQL functions (like LEN, SUBSTRING, REPLACE, RTRIM,...).
If you're able to convert the column, you might as well, since the text data type will be removed in a future version of SQL Server. See here.
The recommendation is to use varchar(MAX) or nvarchar(MAX). In your case, you could also use the XML data type, but that may tie you to certain database engines (if that's a consideration).
You should have a look at
XML Support in Microsoft SQL Server
2005
Beginning SQL Server 2005 XML
Programming
So I would rather try to use the data type appropriate for the use. Not make a datatype fit your use from a previous version.
Here's a little script I wrote for getting out all data
SELECT #data = N'huge data';
DECLARE #readSentence NVARCHAR (MAX) = N'';
DECLARE #dataLength INT = ( SELECT LEN (#data));
DECLARE #currIndex INT = 0;
WHILE #data <> #readSentence
BEGIN
DECLARE #temp NVARCHAR (MAX) = N'';
SET #temp = ( SELECT SUBSTRING (#data, #currIndex, 65535));
SELECT #temp;
SET #readSentence += #temp;
SET #currIndex += 65535;
END;

Convert UTF-8 varbinary(max) to varchar(max)

I have a varbinary(max) column with UTF-8-encoded text that has been compressed. I would like to decompress this data and work with it in T-SQL as a varchar(max) using the UTF-8 capabilities of SQL Server.
I'm looking for a way of specifying the encoding when converting from varbinary(max) to varchar(max). The only way I've managed to do that is by creating a table variable with a column with a UTF-8 collation and inserting the varbinary data into it.
DECLARE #rv TABLE(
Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO #rv
SELECT SUBSTRING(Decompressed, 4, DATALENGTH(Decompressed) - 3) WithoutBOM
FROM
(SELECT DECOMPRESS(RawResource) AS Decompressed FROM Resource) t
I'm wondering if there is a more elegant and efficient approach that does not involve inserting into a table variable.
UPDATE:
Boiling this down to a simple example that doesn't deal with byte order marks or compression:
I have the string "Hello 😊" UTF-8 encoded without a BOM stored in variable #utf8Binary
DECLARE #utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A
Now I try to assign that into various char-based variables and print the result:
DECLARE #brokenVarChar varchar(max) = CONVERT(varchar(max), #utf8Binary)
print '#brokenVarChar = ' + #brokenVarChar
DECLARE #brokenNVarChar nvarchar(max) = CONVERT(varchar(max), #utf8Binary)
print '#brokenNVarChar = ' + #brokenNVarChar
DECLARE #rv TABLE(
Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO #rv
select #utf8Binary
DECLARE #working nvarchar(max)
Select TOP 1 #working = Res from #rv
print '#working = ' + #working
The results of this are:
#brokenVarChar = Hello 😊
#brokenNVarChar = Hello 😊
#working = Hello 😊
So I am able to get the binary result properly decoded using this indirect method, but I am wondering if there is a more straightforward (and likely efficient) approach.
I don't like this solution, but it's one I got to (I initially thought it wasn't working, due to what appears to be a bug in ADS). One method would be to create a new database in a UTF8 collation, and then pass the value to a function in that database. As the database is in a UTF8 collation, the default collation will be different to the local one, and the correct result will be returned:
CREATE DATABASE UTF8 COLLATE Latin1_General_100_CI_AS_SC_UTF8;
GO
USE UTF8;
GO
CREATE OR ALTER FUNCTION dbo.Bin2UTF8 (#utfbinary varbinary(MAX))
RETURNS varchar(MAX) AS
BEGIN
RETURN CAST(#utfbinary AS varchar(MAX));
END
GO
USE YourDatabase;
GO
SELECT UTF8.dbo.Bin2UTF8(0x48656C6C6F20F09F988A);
This, however, isn't particularly "pretty".
There is an undocumented hack:
DECLARE #utf8 VARBINARY(MAX)=0x48656C6C6F20F09F988A;
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',#utf8,']]>') AS XML)
.value('.','nvarchar(max)');
The result
Hello 😊
This works even in versions without the new UTF8 collations...
UPDATE: calling this as a function
This can easily be wrapped in a scalar function
CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(#utfBinary VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
RETURN
(
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',#utfBinary,']]>') AS XML)
.value('.','nvarchar(max)')
);
END
GO
Or like this as an inlined table valued function
CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(#utfBinary VARBINARY(MAX))
RETURNS TABLE
AS
RETURN
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',#utfBinary,']]>') AS XML)
.value('.','nvarchar(max)') AS ConvertedString
GO
This can be used after FROM or - more appropriate - with APPLY
DECLARE #utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A;
DECLARE #brokenNVarChar nvarchar(max) = concat(#utf8Binary, '' COLLATE Latin1_General_100_CI_AS_SC_UTF8);
print '#brokenNVarChar = ' + #brokenNVarChar;
You didn't say how your data is compressed or what compression algorithm was used. But if you are using the COMPRESS function in SQL Server 2016 or later, you can use the DECOMPRESS function and then cast it to a VARCHAR(MAX). Both COMPRESS and DECOMPRESS use the GZip compression algorithm.
This function will decompress an input expression value, using the GZIP algorithm. DECOMPRESS will return a byte array (VARBINARY(MAX) type).
CAST(DECOMPRESS([compressed content here]) AS VARCHAR(MAX))
See: COMPRESS (Transact-SQL) and DECOMPRESS (Transact-SQL)

How to store a string along with Syllabication in varchar column

Is there any way to store āre exactly in SQL server table.
I hardcoded the same value in varchar column. It is saving are. I wanted to store along with special symbols
Use Nvarchar - Nvarchar stores UNICODE data. If you have requirements to store UNICODE or multilingual data, Nvarchar is the choice. You need an N prefix when inserts data. Varchar stores ASCII data.
Refer below sample code
declare #data table
(field1 nvarchar(10))
insert into #data
values
(N'āre')
select * from #data
You need to declare your string assignment using the N prefix (the N
stands for "National Character") as you need to explicitly say you are
passing a string containing unicode characters here (or an nchar,
ntext etc if you were using those).
NVarchar variable are denoted by N' so it would be
DECLARE #objname nvarchar(255)
set #objname=N'漢字'
select #objname
Now the output will be 漢字 as it has been set. Run above code.

TSQL "Illegal XML Character" When Converting Varbinary to XML

I'm trying to create a stored procedure in SQL Server 2016 that converts XML that was previously converted into Varbinary back into XML, but getting an "Illegal XML character" error when converting. I've found a workaround that seems to work, but I can't actually figure out why it works, which makes me uncomfortable.
The stored procedure takes data that was converted to binary in SSIS and inserted into a varbinary(MAX) column in a table and performs a simple
CAST(Column AS XML)
It worked fine for a long time, and I only began seeing an issue when the initial XML started containing an ® (registered trademark) symbol.
Now, when I attempt to convert the binary to XML I get this error
Msg 9420, Level 16, State 1, Line 23
XML parsing: line 1, character 7, illegal xml character
However, if I first convert the binary to varchar(MAX), then convert that to XML, it seems to work fine. I don't understand what is happening when I perform that intermediate CAST that is different than casting directly to XML. My main concern is that I don't want to add it in to account for this scenario and end up with unintended consequences.
Test code:
DECLARE #foo VARBINARY(MAX)
DECLARE #bar VARCHAR(MAX)
DECLARE #Nbar NVARCHAR(MAX)
--SELECT Varbinary
SET #foo = CAST( '<Test>®</Test>' AS VARBINARY(MAX))
SELECT #foo AsBinary
--select as binary as varchar
SET #bar = CAST(#foo AS VARCHAR(MAX))
SELECT #bar BinaryAsVarchar -- Correct string output
--select binary as nvarchar
SET #nbar = CAST(#foo AS NVARCHAR(MAX))
SELECT #nbar BinaryAsNvarchar -- Chinese characters
--select binary as XML
SELECT TRY_CAST(#foo AS XML) BinaryAsXML -- ILLEGAL XML character
-- SELECT CONVERT(xml, #obfoo) BinaryAsXML --ILLEGAL XML Character
--select BinaryAsVarcharAsXML
SELECT TRY_CAST(#bar AS XML) BinaryAsVarcharAsXML -- Correct Output
--select BinaryAsNVarcharAsXML
SELECT TRY_CAST(#nbar AS XML) BinaryAsNvarcharAsXML -- Chinese Characters
There are several things to know:
SQL-Server is rather limited with character encodings. There is VARCHAR, which is 1-byte-encoded extended ASCII and NVARCHAR, which is UCS-2 (almost the same as utf-16).
VARCHAR uses plain latin for the first set of characters and a codepage-mapping provided by the collation in use for the second set.
VARCHAR is not utf-8. utf-8 works with VARCHAR, as long as all characters are 1-byte-enocded. But utf-8 knows a lot of 2-byte-enocded (up to 4-byte-enocded) characters, which would break the internal storage of a VARCHAR string.
NVARCHAR will work with almost any 2-byte encoded character natively (that means with almost any existing character). But it is not exactly utf-16 (there are 3-byte encoded characters, which would break SQL-Servers internal storage).
XML is not stored as the XML-string you see, but as an hierarchically organised physical table, based on NVARCHAR values.
The natively stored XML is really fast, while any text-based storage will need a very expensive parse-operation in advance (over and over...).
Storing XML as string is bad, storing XML as VARCHAR string is even worse.
Storing a VARCHAR-string-XML as VARBINARY is a cummulation of things you should not do.
Try this:
DECLARE #text1Byte VARCHAR(100)='<test>blah</test>';
DECLARE #text2Byte NVARCHAR(100)=N'<test>blah</test>';
SELECT CAST(#text1Byte AS VARBINARY(MAX)) AS text1Byte_Binary
,CAST(#text2Byte AS VARBINARY(MAX)) AS text2Byte_Binary
,CAST(#text1Byte AS XML) AS text1Byte_XML
,CAST(#text2Byte AS XML) AS text2Byte_XML
,CAST(CAST(#text1Byte AS VARBINARY(MAX)) AS XML) AS text1Byte_XML_via_Binary
,CAST(CAST(#text2Byte AS VARBINARY(MAX)) AS XML) AS text2Byte_XML_via_Binary
The only difference you'll see are the many zeros in 0x3C0074006500730074003E0062006C00610068003C002F0074006500730074003E00. This is due to the 2-byte-encoding of nvarchar, each second byte is not needed in this sample. But if you'd need far-east-characters the picture would be completely different.
The reason why it works: SQL-Server is very smart. The cast from the variable to XML is rather easy, as the engine knows, that the underlying variable is varchar or nvarchar. But the last two casts are different. The engine has to examine the binary, whether it is a valid nvarchar and will give it a second try with varchar if it fails.
Now try to add your registered trademark to the given example. Add it first to the second variable DECLARE #text2Byte NVARCHAR(100)=N'<test>blah®</test>'; and try to run this. Then add it to the first variable and try it again.
What you can try:
Cast your binary to varchar(max), then to nvarchar(max) and finally to xml.
,CAST(CAST(CAST(CAST(#text1Byte AS VARBINARY(MAX)) AS VARCHAR(MAX)) AS NVARCHAR(MAX)) AS XML) AS text1Byte_XML_via_Binary
This will work, but it won't be fast...

Stored procedure Inserts Hebrew characters into an NVARCHAR column, but SELECT shows "?"

When I SELECT from the table, the data that I stored is stored as question marks.
#word is a parameter in my stored procedure, and the value comes from the C# code:
string word = this.Request.Form["word"].ToString();
cmd.Parameters.Add("#word", System.Data.SqlDbType.NVarChar).Value = word;
My stored procedure is like this:
CREATE PROCEDURE ....
(
#word nvarchar(500)
...
)
Insert into rub_translate (language_id,name)
values (8 ,#word COLLATE HEBREW_CI_AS )
My database, and the column, is using the SQL_Latin1_General_CP1_CI_AS collation and I cannot change them.
Can anybody give me a solution how can I solve this problem just by modifying the column or the table?
In order for this to work you need to do the following:
Declare the input parameter in the app code as NVARCHAR (you have done this)
Declare the input parameter in the stored procedure as NVARCHAR (no code is shown for this)
Insert or Update a column in a table that is defined as NVARCHAR (you claim that this is the case)
When using NVARCHAR it does not matter what the default Collation of the Database is. And actually, when using NVARCHAR, it won't matter what the Collation of the column in the table is, at least not for properly inserting the characters.
Also, specifying the COLLATE keyword in the INSERT statement is unnecessary and wouldn't help anyway. If you have the stored procedure input parameter defined as VARCHAR, then the characters are already converted to ? upon coming into the stored procedure. And if the column is actually defined as VARCHAR (you haven't provided the table's DDL) then if the Collation isn't Hebrew_* then there is nothing you can do (besides change either the datatype to NVARCHAR or the Collation to a Hebrew_ one).
If those three items listed at the top are definitely in place, then the last thing to check is the input value itself. Since this is a web app, it is possible that the encoding of the page itself is not set correctly. You need to set a break point just at the cmd.Parameters.Add("#word", System.Data.SqlDbType.NVarChar).Value = word; line and confirm that the value held in the word variable contains Hebrew characters instead of ?s.
ALSO: you should never create a string parameter without specifying the max length/size. The default is 30 (in this case, sometimes it's 1), yet the parameter in the stored procedure is defined as NVARCHAR(500). This could result in silent truncation ("silent" meaning that it will not cause an error, it will just truncate the value). Instead, you should always specify the size. For example:
cmd.Parameters.Add("#word", System.Data.SqlDbType.NVarChar, 500).Value = word;
You could just insert it as-is, since it's unicode and then select it with a proper collation:
declare #test table([name] nvarchar(500) collate Latin1_General_CI_AS);
declare #word nvarchar(500) = N'זה טקסט.';
insert into #test ( [name] ) values ( #word );
select [t].[name] collate Hebrew_CI_AS from #test as [t]
Or you can change the collation of that one column in the table all together. But remember that there is a drawback of having a different collation from your database in one or more columns: you will need to add the collate statement to queries when you need to compare data between different collations.

Resources