LEN and DATALENGTH of VARCHAR and NVARCHAR - sql-server

After reading "What is the difference between char, nchar, varchar, and nvarchar in SQL Server?" I have a question.
I'm using MS SQL Server 2008 R2
DECLARE #T TABLE
(
C1 VARCHAR(20) COLLATE Chinese_Traditional_Stroke_Order_100_CS_AS_KS_WS,
C2 NVARCHAR(20) COLLATE Chinese_Traditional_Stroke_Order_100_CS_AS_KS_WS
)
INSERT INTO #T VALUES (N'中华人民共和国',N'中华人民共和国')
SELECT LEN(C1) AS [LEN(C1)],
DATALENGTH(C1) AS [DATALENGTH(C1)],
LEN(C2) AS [LEN(C2)],
DATALENGTH(C2) AS [DATALENGTH(C2)]
FROM #T
Returns
LEN(C1) DATALENGTH(C1) LEN(C2) DATALENGTH(C2)
----------- -------------- ----------- --------------
7 12 7 14
Why the second DATALENGTH(C1) is 12?

In your INSERT you are converting text from unicode to chinese codepage for C1. Most likely this process alters the text and something may be lost.
Here is SQL Fiddle.
You can see that the second character 华 is stored as 3F in varchar. You can also see that the last character 国 is also stored as 3F in varchar. 3F is a code for ?. When Windows tries to convert text from unicode to the codepage and certain character can't be represented in the given codepage, the conversion function (most likely WideCharToMultiByte) puts ? for such characters .
One more example. The last, but one character 和 is encoded as A94D in varchar and 8C54 in nvarchar. If you look it up in Character Map it will show these codes (unicode and codepage):
See also:
What does it mean when my text is displayed as Question Marks?
https://www.microsoft.com/middleeast/msdn/Questionmark.aspx
Any time Unicode data must be displayed, they may be internally
converted from Unicode using the WideCharToMultiByte API. Any time a
character cannot be represented on the current code page, it will be
replaced by a question mark (?).
This is exactly what is happening when you store a unicode literal N'中华人民共和国' in a varchar column. The unicode text is converted to multi-byte and some characters can't be represented in that code page and they are replaced by question marks ?.

Related

Comparing the same character in VARCHAR and NVARCHAR differs between CP1/CP1252 vs. CP850 based on DB collation

Here are my two variables:
DECLARE #First VARCHAR(254) = '5’-Phosphate Analogs Freedom to Operate'
DECLARE #Second NVARCHAR(254) = CONVERT(NVARCHAR(254), #First)
I have two databases, let's call them "Database1" and "Database2". Database1 has a default collation of SQL_Latin1_General_CP850_CI_AS; Database2 is SQL_Latin1_General_CP1_CI_AS. Both databases have a compatibility level of SQL Server 2008 (100).
I first connect to Database1 and run the following queries:
SELECT CASE
WHEN #First COLLATE SQL_Latin1_General_CP1_CI_AS
= #Second COLLATE SQL_Latin1_General_CP1_CI_AS
THEN 'Equal' ELSE 'Not Equal' END
SELECT CASE
WHEN #First COLLATE SQL_Latin1_General_CP850_CI_AS
= #Second COLLATE SQL_Latin1_General_CP850_CI_AS
THEN 'Equal' ELSE 'Not Equal' END
The results are:
Equal
Equal
Then I connect to Database2 and run the queries; the results are:
Equal
Not Equal
Note that I have not changed the queries themselves, just the db connection, and I'm specifying the collations to be used rather than allowing them to use the databases' default collations. Therefore, it's my understanding that the database default collation should not matter, i.e. the results of the queries should be the same regardless of which database I'm connected to.
I have three questions:
Why do I get different results when the only thing I change is the database to which I'm connected, given that I've effectively ignored the default database collation by explicitly specifying my own?
For the test against Database 2, why does the comparison succeed with the SQL_Latin1_General_CP1_CI_AS collation and fail with the SQL_Latin1_General_CP850_CI_AS collation? What is the difference between the two collations that account for this?
Most Perplexing: If the default collation of the database to which I'm connected does matter, as it would seem, and the default collation of Database1 is SQL_Latin1_General_CP850_CI_AS (which, remember from my first test resulted in Equal, Equal) why does the second query, which explicitly specifies the very same collation fail (Not Equal) when connected to Database2?
Simply because this is how non-Unicode data works. Non-Unicode data (i.e. 8-bit Extended ASCII) uses the same characters for the first 128 values, but different characters for the second set of 128 characters, based on the Code Page. The character you are testing — ’ — exists in Code Page 1252 but not in Code Page 850.
Yes, the default Collation of the "current" database absolutely matters for string literals and local variables. When you are in a database with a default Collation that uses Code Page 850, that non-Unicode string literal (i.e. a string that is not prefixed with N) automatically converts the value to an equivalent that does exist in Code Page 850. BUT, that character does indeed exist in Code Page 1252, so there is no need for it to be converted.
So why is it "not equal" when in a database using a Collation associated with Cod Page 1252 between the non-Unicode string and the Unicode string? Because when converting the non-Unicode string into Unicode, another conversion takes place that translates the character into its true Unicode value, which is above decimal value 256.
Run the following in both databases and you will see what happens:
SELECT ASCII('’') AS [AsciiValue], UNICODE('’') AS [CodePoint];
SELECT ASCII('’' COLLATE SQL_Latin1_General_CP1_CI_AS) AS [AsciiValue],
UNICODE('’' COLLATE SQL_Latin1_General_CP1_CI_AS) AS [CodePoint];
SELECT ASCII('’' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [AsciiValue],
UNICODE('’' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [CodePoint];
Results when the "current" database uses a Collation associated with Code Page 850 (all 3 queries return the same thing):
AsciiValue CodePoint
39 39
As you can see from the above, specifying COLLATE on a string literal is after the fact of how that string has already been interpreted with respect to the default Collation of the "current" database.
Results when the "current" database uses a Collation associated with Code Page 1252:
-- no COLLATE clause
AsciiValue CodePoint
146 8217
-- COLLATE SQL_Latin1_General_CP1_CI_AS
AsciiValue CodePoint
146 8217
-- COLLATE SQL_Latin1_General_CP850_CI_AS
AsciiValue CodePoint
39 39
But why the conversion from 146 to 8217 if the character is available in Code Page 1252? Because the first 256 characters in Unicode are not Code Page 1252, but instead are ISO-8859-1. These two Code Pages are mostly the same, but differ by several character in the 128 - 255 range. In the ISO-8859-1 Code Page, those values are control characters. Microsoft felt it better to not waste 16 (or however many) characters on non-printable control characters when the limit was already 256 characters. So they swapped out the control characters for more usable ones, and hence Code Page 1252. But the Unicode group used the standardized ISO-8859-1 for the first 256 characters.
Why does this matter? Because the character you are testing with is one of the lucky few that is in Code Page 1252 but not in ISO-8859-1, hence it cannot remain as 146 when converted to NVARCHAR, and gets translated to its Unicode value, which is 8217. You can see this behavior by running the following:
SELECT '~' + CHAR(146) + '~', N'~' + NCHAR(146) + N'~';
-- ~’~ ~~
Everything shown above explains most of the observed behavior, but does not explain why #First and #Second, when specified with COLLATE SQL_Latin1_General_CP850_CI_AS but running in a database having a default Collation associated with Code Page 1252, register as "Not Equal". If using Code Page 850 translates them to ASCII 39, they should still be equal, right?
This is due to both the sequence of events and the fact that Code Pages are not relevant to Unicode data (i.e. anything stored in NCHAR, NVARCHAR, and the deprecated NTEXT type that nobody should be using). Breaking down what is happening:
Start with #First being declared and initialized (i.e. DECLARE #First VARCHAR(1) = '’';). It is a VARCHAR type, hence using a Code Page, and hence using the Code Page associated with the default Collation of the "current" database.
The default Collation of the "current" database is associated with Code Page 1252, hence this value is not translated to ASCII 39, but exists happily as ASCII 146.
Next #Second is declared and initialized (i.e. DECLARE #Second NVARCHAR(1) = #First; -- no need for explicit CONVERT as this is not production code and it will be converted implicitly). This is an NVARCHAR type which, as we have seen, has the character, but converts the value from ASCII 146 to Code Point U+2019 (Decimal 8217 = 0x2019).
In the comparison, using #First COLLATE SQL_Latin1_General_CP850_CI_AS starts with ASCII 146 as #First is VARCHAR data using the Code Page specified by the default Collation of the "current" database. But then, since that character does not exist in Code Page 850 (as specified by the Collation used in the COLLATE clause) it gets translated into ASCII 39 (as we have seen above).
Why didn't #Second COLLATE SQL_Latin1_General_CP850_CI_AS also translate that character to ASCII 39 so that they would register as "Equal"? Because:
#Second is NVARCHAR and does not use Code Pages as all characters are represented in a single character set (i.e. Unicode). So changing the Collation can only change the rules governing how to compare and sort the characters, but will not alter the characters such as what happens sometimes when changing the Collation of VARCHAR data (like in this case of ’). Hence this side of the comparison is still Code Point U+2019.
#First, being VARCHAR will get implicitly converted into NVARCHAR for the comparison. BUT, the ’ character had already been translated into ASCII 39 by the COLLATE SQL_Latin1_General_CP850_CI_AS clause, and ASCII 39 is found in Unicode in that same position, either as Decimal 39 or Code Point U+0027 (from SELECT CONVERT(BINARY(2), 39)).
Resulting comparison is between: Code Point U+2019 and Code Point U+0027
Ergo: Not Equal
For more info on working with Collations, Encodings, Unicode, etc, please visit: Collations Info

Why can I store an Ukrainian string in a varchar column?

I got a little surprised as I was able to store an Ukrainian string in a varchar column .
My table is:
create table delete_collation
(
text1 varchar(100) collate SQL_Ukrainian_CP1251_CI_AS
)
and using this query I am able to insert:
insert into delete_collation
values(N'використовується для вирішення квитки')
but when I am removing 'N' it is showing ?????? in the select statement.
Is it okay or am I missing something in understanding unicode and non-unicode with collate?
From MSDN:
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
UPDATE:
Please see a similar questions::
What is the meaning of the prefix N in T-SQL statements?
Cyrillic symbols in SQL code are not correctly after insert
sql server 2012 express do not understand Russian letters
To expand on MegaTron's answer:
Using collate SQL_Ukrainian_CP1251_CI_AS, SQL server is able to store ukrainian characters in a varchar column by using CodePage 1251.
However, when you specify a string without the N prefix, that string will be converted to the default non-unicode codepage before it is sent to the database, and that is why you see ??????.
So it is completely fine to use varchar and collate as you do, but you must always include the N prefix when sending strings to the database, to avoid the intermediate conversion to default (non-ukrainian) codepage.

Why I can insert non-ascii characters into VARCHAR column and correctly get it back?

Below is my code sample.
DECLARE #a TABLE (a VARCHAR(20));
INSERT #a
(a)
VALUES ('中');
SELECT *
FROM #a;
I'm using SQL Server Management Studio to run it. My question is, why I can insert non-ascii characters into VARCHAR column and correctly get it back? As I understand, VARCHAR type is only for ascii characters and the NVARCHAR is for unicode characters. Anyone can help to explain it please? I'm on Windows 7 with SQL Server 2014 developer edition.
The codepage used to store the varchar data varies by DB collation.
https://msdn.microsoft.com/en-us/library/ms189617.aspx
Varchar is 8 bits, so you may have a different collation, or you may have gotten lucky on where your character falls on the code set
You can find the ASCII and Extended ASCII characters below.
ASCII
Extended ASCII
I don't believe '中' is an ASCII character.
www.asciitable.com

Why does casting a UTF-8 VARCHAR column to XML require converting to NVARCHAR and encoding change?

I am trying to convert data in a varchar column to XML but I was getting errors with certain characters. Running this ...
-- This fails
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(#Data AS XML) AS DataXml
... results in the following error
Msg 9420, Level 16, State 1, Line 3
XML parsing: line 1, character 55, illegal xml character
It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8. Looking at the XML spec it appears to be valid.
When I change it to this ...
-- This works
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(CAST(#Data AS NVARCHAR(MAX)), 'encoding="utf-8"', '') AS XML) AS DataXml
... it works without error (replacing encoding string to utf-16 also works). I'm using SQL Server 2008 R2 with SQL_Latin1_General_CP1_CI_AS Coallation.
Can anyone tell my why I need to convert to NVARCHAR and strip the encoding="utf-8" for this to work?
Thanks,
Edit
It appears that this also works ...
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(#Data, 'encoding="utf-8"', '') AS XML) AS DataXml
Removing the utf-8 encoding from the prolog is sufficient for SQL Server to do the conversion.
Remy's answer is, unfortunately, incorrect. VARCHAR absolutely does support Extended ASCII. Standard ASCII is only the first 128 values (0x00 - 0x7F). That happens to be the same for all code pages (i.e. 8-bit VARCHAR data) and UTF-16 (i.e. 16-bit NVARCHAR data) in SQL Server. Extended ASCII covers the remaining 128 of the 256 total values (0x80 - 0xFF). These 128 values / code points differ per code page, though there is a lot of overlap between some of them.
Remy states that VARCHAR does not support U+00A6 BROKEN BAR. This is easily disproven by simply adding SELECT #Data; after the first line:
DECLARE #Data VARCHAR(1000) =
'<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT #Data;
That returns:
<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>
The ¦ character is clearly supported, so the problem must be something else.
It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8.
The broken pipe character is a valid character in UTF-8. The problem is: you aren't passing in UTF-8 data. Yes, you state that the encoding is UTF-8 in the xml declaration, but that doesn't mean that the data is UTF-8, it merely sets the expectation that it needs to be UTF-8.
You are converting a VARCHAR literal into XML. Your database's default collation is SQL_Latin1_General_CP1_CI_AS which uses the Windows-1252 code page for VARCHAR data. This means that the broken vertical bar character has a value of 166 or 0xA6. Well, 0xA6 is not a valid UTF-8 encoded anything. If you were truly passing in UTF-8 encoded data, then that broken vertical bar character would be two bytes: 0xC2 and then 0xA6. If we add that 0xC2 byte to the original input value (the 0xA6 is the same, so we can keep that where it is), we get:
DECLARE #Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test'
+ CHAR(0xC2) + '¦</NewDataSet>';
SELECT #Data AS [#Data];
SELECT CAST(#Data AS XML) AS [DataXml];
and that returns:
<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>
followed by:
<NewDataSet>Test¦</NewDataSet>
This is why removing the encoding="utf-8" fixed the problem:
with it there, the bytes of that string needed to actually be UTF-8 but they weren't, and ...
with it removed, the encoding is assumed to be the same as the string itself, which is VARCHAR, and that means the encoding is the code page associated with the collation of the string, and a VARCHAR literal or variable uses the database's default collation. Meaning, in this context, either without the encoding="xxxxxx", or with encoding="Windows-1252", the bytes will need to be encoded as Windows-1252, and indeed they are.
Putting this all together, we get:
If you have an actual UTF-8 encoded string, then it can be passed into the XML datatype, but you need to have:
no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
the XML declaration stating that the encoding is UTF-8
If you have a string encoded in the code page that is associated with the database's default collation, then you need to have:
no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
either no "encoding" as part of an <?xml ?> declaration, or have encoding set to the code page associated with the database's default collation (e.g. Windows-1252 for code page 1252)
If your string is already Unicode, then you need to:
prefix a string literal with an upper-case "N" or use an NVARCHAR variable or column for the incoming XML
have either no "encoding" as part of an <?xml ?> declaration, or have encoding set to "utf-16"
Please see my answer to "Converting accented characters in varchar() to XML causing “illegal XML character”" for more details on this.
And, just to have it stated: while SQL Server 2019 introduced native support for UTF-8 in VARCHAR literals, variables, and columns, that has no impact on what is being discussed in this answer.
For info on collations, character encoding, etc, please visit: Collations Info
Your pipe character is using Unicode codepoint U+00A6 BROKEN BAR instead of U+007C VERTICAL LINE. U+00A6 is outside of ASCII. VARCHAR does not support non-ASCII characters. That is why you have to use NVARCHAR instead, which is designed to handle Unicode data.

SQL Server character set and N prefix

[THIS IS NOT A QUESTION ABOUT NVARCHAR OR HOW TO STORE CHINESE CHARACTER]
SQL Server 2008 Express
Database collation is SQL_Latin1_General_CP1_CI_AS
create table sample1(val varchar(2))
insert into sample1 values(N'中文')
I know these Chinese characters would become junk characters.
I know I can use nvarchar to overcome all problem.
What I don't know is: why there isn't "string too long" error when I run the insert statement?
N prefix means that client will encode the string using UNICODE.
2 Chinese characters will become 4 bytes.
varchar(2) can only contain 2 bytes.
Why people down vote this question? really?
An implied cast takes place. This would work if "val" was created as nvarchar(2).
More explanation to #marc_s answer.
The character N'中文' will be converted to varchar with the collation SQL_Latin1_General_CP1_CI_AS. Since there is no such character in the code page, it will converted to not defined, and 0x3f3f in the end. 0x3f is the question mark, so there will be two question marks in this case and it won't exceed the column length.
Try to use NVARCHAR(...), NCHAR(...) datatypes -
CREATE TABLE dbo.sample1
(
val NVARCHAR(4)
)
INSERT INTO dbo.sample1
SELECT N'中文'

Resources