Select Hex/Char conversion - sql-server

I have some data in a SQL database stored in the format below, which I would like to convert to a readable string:
540045005300540049004E00470031003200330034
I would like to run some kind of SELECT statement to return the text which should be TESTING1234
It appears to be in Hex format separated by 00 between each character, so if I run these statements:
SELECT CHAR(0x54)
SELECT CHAR(0x45)
This returns:
T
E
Is there any way I can convert the whole string in one statement?
Thanks!

The 00 point to 2-byte-enocding which is represented as NVARCHAR. Try this
SELECT CAST(0x540045005300540049004E00470031003200330034 AS NVARCHAR(MAX))
Or directly from the HEX-string as string:
SELECT CAST(CONVERT(VARBINARY(MAX),'540045005300540049004E00470031003200330034',2) AS NVARCHAR(MAX));
The result is TESTING1234
Some more background on string encoding
SQL-Server knows exactly two types of strings:
1-byte-encoded VARCHAR / CHAR
2-byte-encoded nVARCHAR / nCHAR
The 1-byte string is extended ASCII, the related collation provides a code page to map non-plain-latin characters (it is not utf-8 as people sometimes tell).
The 2-byte string is UCS-2 (almost the same as utf-16).
I've corrected the word unicode above, as it is not correct actually.
There are many encodings SQL-Server will not be able to interpret natively.
The string above looks like it is good for NVARCHAR, but this is not guaranteed in any case.
Some more background on binary encoding
SQL-Server knows BINARY and VARBINARY as a real BLOB-Type. In the result of a SELECT they are presented as HEX-string and in a script you can use a HEX-string as native input. But it is important to know, that this HEX-string is not the actual value!, just the human readable representation on a computer screen.
And there is a real string, which looks like a HEX-string (but isn't).
0x123 != '0x123'
If you have a string, which is a HEX-string, but is coming to you as "normal" string (e.g. in a text based container like a CSV file or an XML) you have to convert this.
And, not really related to this question, just to mention it: There are more string based binary representers like base64.
Some examples
--start with a normal string
DECLARE #str VARCHAR(100)='This is a test to explain conversions from string to hex to binary and back';
--see the HEX string (real binary!)
SELECT CAST(#str AS VARBINARY(MAX)) ThisIsTheHexStringOfTheString;
--I copy the binary behind the "=" _wihtout_ quotes
DECLARE #ThisIsTheBinary VARBINARY(MAX)=0x546869732069732061207465737420746F206578706C61696E20636F6E76657273696F6E732066726F6D20737472696E6720746F2068657820746F2062696E61727920616E64206261636B;
--This can be re-casted directly
SELECT CAST(#ThisIsTheBinary AS VARCHAR(MAX)) ThisIsReconvertedBinary;
--there is an undocumented function providing a HEX-string from a binary
DECLARE #aHEXstring VARCHAR(MAX)=sys.fn_varbintohexstr(CAST(#str AS VARBINARY(MAX)));
--This string looks exactly the same as above, but it is a string
SELECT #aHEXstring AS ThisIsStringWhichLooksLikeHEX;
--You can use dynamic SQL
EXEC('SELECT CAST(' + #aHEXstring + ' AS VARCHAR(MAX)) AS CastedViaDynamicSQL');
--or CONVERT's abilities (read the documentation!)
SELECT CAST(CONVERT(VARBINARY(MAX),#aHEXstring,1) AS VARCHAR(MAX)) AS ConvertedViaCONVERT

Related

SQL string to varbinary through XML different than nvarchar

We currently have a function in SQL which I simply do not understand.
Currently we convert a nvarchar to XML, and then select the XML value, and convert that to a varbinary.
When I try to simplify this to convert the nvarchar directly to varbinary, the output is different... Why?
--- Current situation:
Declare #inputString nvarchar(max) = '4d95605d1b8f3bca5ea3e0d2af26027004d17218152e726da0622d669a71f85c'
--1: input to XML
declare #inputXML XML = convert(varchar(max), #inputString)
--2: input XML to binary
declare #inputBinray varbinary(max) = #inputXML.value('(/)[1]', 'varbinary(max)')
select #inputString -- 4d95605d1b8f3bca5ea3e0d2af26027004d17218152e726da0622d669a71f85c
select #inputXML -- 4d95605d1b8f3bca5ea3e0d2af26027004d17218152e726da0622d669a71f85c
select #inputBinray -- 0xE1DF79EB4E5DD5BF1FDDB71AE5E6B77B477669FDBAD36EF4D38775EF6D7CD79D9EEF6E9D6B4EB6D9DEBAF5AEF57FCE5C
--- New situation
--1: Input to binary
declare #inputString2 varbinary(max) = CAST(#inputString as varbinary(max));
select #inputString2 -- 0x3400640039003500360030003500640031006200380066003300620063006100350065006100330065003000640032006100660032003600300032003700300030003400640031003700320031003800310035003200650037003200360064006100300036003200320064003600360039006100370031006600380035006300
Using the value() function to get a XML value specified as varbinary(max) will read the data as if it was Base64 encoded. Casting a string to varbinary(max) does not, it treats it as just any string.
If you use the input string QQA= which is the letter A in UTF-16 LE encoded to Base64 you will see more clearly what is happening.
XML gives you 0x4100, the varbinary of the letter A, and direct cast on the string gives you 0x5100510041003D00 where you have two 5100 = "Q" and of course one 4100 = "A" followed by a 3D00 = "="
Might be I get something wrong, but - if I understand you correctly - I think you simply want to get a real binary from a HEX-string, which just looks like a binary. Correct?
Above I wrote "simply", but this was not simple at all a while ago.
I'm not sure at the moment, but I think it was version v2012, which enhanced CONVERT() (read about binary values and how the third parameter works) and try this:
DECLARE #hexString VARCHAR(max)='4d95605d1b8f3bca5ea3e0d2af26027004d17218152e726da0622d669a71f85c';
SELECT CONVERT(varbinary(max),#hexString,2);
The result is a real binary
0x4D95605D1B8F3BCA5EA3E0D2AF26027004D17218152E726DA0622D669A71F85C
What might be the reason for your issue:
Very long ago, I think it was until v2005, the default encoding of varbinaries in XML was a HEX string. Later this was changed to base64. Might be, that you code was used in a very old environment and was upgraded to a higher version?
Today we use XML in a smiliar way to create and to read base64, which is not supported otherwise. Maybe your code did something similar with HEX strings...?
One more hint for this: The many 00 in your New Situation example show clearly, that this is a two-byte encoded NVARCHAR string. Contrary, your Current Situation shows a simple HEX string.
Your final result is just the binary pattern of your input as string:

How to remove weird Excel character in SQL Server?

There is a weird whitespace character I can't seem to get rid of that occasionally shows up in my data when importing from Excel. Visibly, it comes across as a whitespace character BUT SQL Server sees it as a question mark (ASCII 63).
declare #temp nvarchar(255); set #temp = 'carolg#c?am.com'
select #temp
returns:
?carolg#c?am.com
How can I get rid of the whitespace without getting rid of real question marks? If I look at the ASCII code for each of those "?" characters I get 63 when in fact, only one of them is a real qustion mark.
Have a look at this answer for someone with a similar issue. Sorry if this is a bit long winded:
SQL Server seems to flatten Unicode to ASCII by mapping unrepresentable characters (for which there is no suitable substitution) to a question mark. To replicate this, try opening the Character Map Windows program (should be installed on most machines), select Arial as the font and find U+034f "Combining Grapheme Joiner". select this character, copy to clipboard and paste it between the single quotes below:
declare #t nvarchar(10)
set #t = '͏'
select rtrim(ltrim(#t)) -- we can try and trim it, but by this stage it's already a '?'
You'll get a question mark out, because it doesn't know how to represent this non-ASCII character when it casts it to varchar. To force it to accept it as a double-byte character (nvarchar) you need to use N'' instead, as has already been mentioned. Add an N before the quotes above and the question mark disappears (but the original invisible character is preserved in the output - and ltrim and rtrim won't remove it as demonstrated below):
declare #t nvarchar(10),
#s varchar(10) -- note: single-byte string
set #t = rtrim(ltrim(N'͏')) -- trimming doesn't work here either
set #s = #t
select #s -- still outputs a question mark
Imported data can definitely do this, I've seen it before, and characters like the one I've shown above are particularly hard to diagnose because you can't see them! You will need to create some sort of scrubbing process to remove these unprintables (and any other junk characters, for that matter), and make sure that you use nvarchar everywhere, or you'll end up with this issue. Worse, those phantom question marks will become real question marks that you won't be able to distinguish from legitimate ones.
To see what character code you're dealing with, you can cast as varbinary as follows:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- returns 0x4F0374006500730074003F00
-- Returns:
-- 0x4F03 7400 6500 7300 7400 3F00
-- badchar t e s t ?
Now to get rid of it:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- bad char
set #t = replace(#t COLLATE Latin1_General_100_BIN2, nchar(0x034f), N'');
select cast(#t as varbinary) -- gone!
Note I had to swap the byte order from 0x4f03 to 0x034f (same reason "t" appears in the output as 0x7400, not 0x0074). For some notes on why we're using binary collation, see this answer.
This is kind of messy, because you don't know what the dirty characters are, and they could be one of thousands of possibilities. One option is to iterate over strings using like or even the unicode() function and discard characters in strings that aren't in a list of acceptable characters, but this could be slow. It may be that most of your bad characters are either at the start or end of the string, which might speed this process up if that's an assumption you think you can make.
You may need to build additional processes either external to SQL Server or as part of a SSIS import based on what I've shown you above to strip this out quickly if you have a lot of data to import. If you aren't sure the best way to do this, that's probably best answered in a new question.
I hope that helps.

LEN and DATALENGTH of VARCHAR and NVARCHAR

After reading "What is the difference between char, nchar, varchar, and nvarchar in SQL Server?" I have a question.
I'm using MS SQL Server 2008 R2
DECLARE #T TABLE
(
C1 VARCHAR(20) COLLATE Chinese_Traditional_Stroke_Order_100_CS_AS_KS_WS,
C2 NVARCHAR(20) COLLATE Chinese_Traditional_Stroke_Order_100_CS_AS_KS_WS
)
INSERT INTO #T VALUES (N'中华人民共和国',N'中华人民共和国')
SELECT LEN(C1) AS [LEN(C1)],
DATALENGTH(C1) AS [DATALENGTH(C1)],
LEN(C2) AS [LEN(C2)],
DATALENGTH(C2) AS [DATALENGTH(C2)]
FROM #T
Returns
LEN(C1) DATALENGTH(C1) LEN(C2) DATALENGTH(C2)
----------- -------------- ----------- --------------
7 12 7 14
Why the second DATALENGTH(C1) is 12?
In your INSERT you are converting text from unicode to chinese codepage for C1. Most likely this process alters the text and something may be lost.
Here is SQL Fiddle.
You can see that the second character 华 is stored as 3F in varchar. You can also see that the last character 国 is also stored as 3F in varchar. 3F is a code for ?. When Windows tries to convert text from unicode to the codepage and certain character can't be represented in the given codepage, the conversion function (most likely WideCharToMultiByte) puts ? for such characters .
One more example. The last, but one character 和 is encoded as A94D in varchar and 8C54 in nvarchar. If you look it up in Character Map it will show these codes (unicode and codepage):
See also:
What does it mean when my text is displayed as Question Marks?
https://www.microsoft.com/middleeast/msdn/Questionmark.aspx
Any time Unicode data must be displayed, they may be internally
converted from Unicode using the WideCharToMultiByte API. Any time a
character cannot be represented on the current code page, it will be
replaced by a question mark (?).
This is exactly what is happening when you store a unicode literal N'中华人民共和国' in a varchar column. The unicode text is converted to multi-byte and some characters can't be represented in that code page and they are replaced by question marks ?.

Why does casting a UTF-8 VARCHAR column to XML require converting to NVARCHAR and encoding change?

I am trying to convert data in a varchar column to XML but I was getting errors with certain characters. Running this ...
-- This fails
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(#Data AS XML) AS DataXml
... results in the following error
Msg 9420, Level 16, State 1, Line 3
XML parsing: line 1, character 55, illegal xml character
It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8. Looking at the XML spec it appears to be valid.
When I change it to this ...
-- This works
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(CAST(#Data AS NVARCHAR(MAX)), 'encoding="utf-8"', '') AS XML) AS DataXml
... it works without error (replacing encoding string to utf-16 also works). I'm using SQL Server 2008 R2 with SQL_Latin1_General_CP1_CI_AS Coallation.
Can anyone tell my why I need to convert to NVARCHAR and strip the encoding="utf-8" for this to work?
Thanks,
Edit
It appears that this also works ...
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(#Data, 'encoding="utf-8"', '') AS XML) AS DataXml
Removing the utf-8 encoding from the prolog is sufficient for SQL Server to do the conversion.
Remy's answer is, unfortunately, incorrect. VARCHAR absolutely does support Extended ASCII. Standard ASCII is only the first 128 values (0x00 - 0x7F). That happens to be the same for all code pages (i.e. 8-bit VARCHAR data) and UTF-16 (i.e. 16-bit NVARCHAR data) in SQL Server. Extended ASCII covers the remaining 128 of the 256 total values (0x80 - 0xFF). These 128 values / code points differ per code page, though there is a lot of overlap between some of them.
Remy states that VARCHAR does not support U+00A6 BROKEN BAR. This is easily disproven by simply adding SELECT #Data; after the first line:
DECLARE #Data VARCHAR(1000) =
'<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT #Data;
That returns:
<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>
The ¦ character is clearly supported, so the problem must be something else.
It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8.
The broken pipe character is a valid character in UTF-8. The problem is: you aren't passing in UTF-8 data. Yes, you state that the encoding is UTF-8 in the xml declaration, but that doesn't mean that the data is UTF-8, it merely sets the expectation that it needs to be UTF-8.
You are converting a VARCHAR literal into XML. Your database's default collation is SQL_Latin1_General_CP1_CI_AS which uses the Windows-1252 code page for VARCHAR data. This means that the broken vertical bar character has a value of 166 or 0xA6. Well, 0xA6 is not a valid UTF-8 encoded anything. If you were truly passing in UTF-8 encoded data, then that broken vertical bar character would be two bytes: 0xC2 and then 0xA6. If we add that 0xC2 byte to the original input value (the 0xA6 is the same, so we can keep that where it is), we get:
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test'
+ CHAR(0xC2) + '¦</NewDataSet>';
SELECT #Data AS [#Data];
SELECT CAST(#Data AS XML) AS [DataXml];
and that returns:
<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>
followed by:
<NewDataSet>Test¦</NewDataSet>
This is why removing the encoding="utf-8" fixed the problem:
with it there, the bytes of that string needed to actually be UTF-8 but they weren't, and ...
with it removed, the encoding is assumed to be the same as the string itself, which is VARCHAR, and that means the encoding is the code page associated with the collation of the string, and a VARCHAR literal or variable uses the database's default collation. Meaning, in this context, either without the encoding="xxxxxx", or with encoding="Windows-1252", the bytes will need to be encoded as Windows-1252, and indeed they are.
Putting this all together, we get:
If you have an actual UTF-8 encoded string, then it can be passed into the XML datatype, but you need to have:
no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
the XML declaration stating that the encoding is UTF-8
If you have a string encoded in the code page that is associated with the database's default collation, then you need to have:
no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
either no "encoding" as part of an <?xml ?> declaration, or have encoding set to the code page associated with the database's default collation (e.g. Windows-1252 for code page 1252)
If your string is already Unicode, then you need to:
prefix a string literal with an upper-case "N" or use an NVARCHAR variable or column for the incoming XML
have either no "encoding" as part of an <?xml ?> declaration, or have encoding set to "utf-16"
Please see my answer to "Converting accented characters in varchar() to XML causing “illegal XML character”" for more details on this.
And, just to have it stated: while SQL Server 2019 introduced native support for UTF-8 in VARCHAR literals, variables, and columns, that has no impact on what is being discussed in this answer.
For info on collations, character encoding, etc, please visit: Collations Info
Your pipe character is using Unicode codepoint U+00A6 BROKEN BAR instead of U+007C VERTICAL LINE. U+00A6 is outside of ASCII. VARCHAR does not support non-ASCII characters. That is why you have to use NVARCHAR instead, which is designed to handle Unicode data.

text encodings in .net, sql server processing

I have an application that gets terms from a DB to run as a list of string terms. The DB table was set up with nvarchar for that column to include all foreign characters. Now in some cases where characters like ä will come through clearly when getting the terms from the DB and even show that way in the table.
When importing japanese or arabic characters, all I see are ????????.
Now I have tried converting it using different methods, first converting it into utf8 encoding and then back and also secondly using the httputility.htmlencode which works perfectly when it is these characters but then converts quotes and other stuff which I dont need it to do.
Now I accused the db designer that he needs to do something on his part but am I wrong in that the DB should display all these characters and make it easy to just query it and add to my ssearch list. If not is there a consistent way of getting all international characters to display correctly in SQL and VB.net
I know when I have read from text files I just used the Microsoft.visualbasic.textfieldparser reader tool with encoding set to utf8 and this would not be an issue.
If the database field is nvarchar, then it will store data correctly. As you have seen.
Somewhere before it gets to the database, the data is being lost or changed to varchar: stored procedure, parameters, file encoding, ODBC translation etc.
DECLARE #foo nvarchar(100), #foo2 varchar(100)
--with arabic and japanese and proper N literal
SELECT #foo = N'العربي 日本語', #foo2 = N'العربي 日本語'
SELECT #foo, #foo2 -- gives العربي 日本語
--now a varchar literal
SELECT #foo = 'العربي 日本語', #foo2 = 'العربي 日本語'
SELECT #foo, #foo2 --gives ?????? ???
--from my Swiss German keyboard. These are part of my code page.
SELECT #foo = 'öéäàüè', #foo2 = 'öéäàüè'
SELECT #foo, #foo2 --gives ?????? ???
So, apologise to the nice DB monkey... :-)
Always try to use NVARCHAR or NTEXT to store foreign charactesr.
you cannot store UNICODE in varchar ot text datatype.
Also put a N before string value
like
UPDATE [USER]
SET Name = N'日本語'
WHERE ID = XXXX;

Resources