SQL Server Unicode queries with SC Collation - sql-server

In SQL Server 2012 I have a table with an nvarchar column with collation Latin1_General_100_CI_AS_SC, which is supposed to support unicode surrogate pair characters, or supplementary characters.
When I run this query:
select KeyValue from terms where KeyValue = N'➰'
(above is a Unicode SC)
above is a curly loop character with code 10160 (x27B0)
The result is hundreds of different looking single character entries, even though they all have different UTF-16 codepoints. Is this due to collation? Why isn't there an exact match?
EDIT: I now think this is due to collation. There seems to be a group of "undefined" characters in the UTF-16 range, more than 1733 characters, and they are treated as the same by this collation. Although, characters with codes above 65535 are treated as unique and those queries return exact matches.
The two queries below have different results:
select KeyValue from terms where KeyValue = N'π'
returns 3 rows: π and ℼ and ᴨ
select KeyValue from terms where KeyValue LIKE N'π'
returns 2 rows: π and ℼ
Why is this?
This is the weirdest of all. This query:
select KeyValue from terms where KeyValue like N'➰%'
returns ALMOST ALL records in the table, which has many multiple character regular latin character set terms like "8w" or "apple". 90% of those not being returned are starting with "æ". What is happening?
NOTE: Just to give this a bit of context, these are all Wikipedia article titles, not random strings.

SQL Server and thus tempdb also have their own collation, and they may not be the same as a database's or a column's collation. While character literals should be assigned the default collation of the column or database, the above (perhaps overly simplified) T-SQL examples could be misstating/not revealing the true problem. For example, an ORDER BY clause could have been omitted for the sake of simplicity. Are expected results returned when above statements explicitly use https://msdn.microsoft.com/en-us/library/ms184391.aspx ('COLLATE Latin1_General_100_CI_AS_SC')?

I have a table with an nvarchar column with collation Latin1_General_100_CI_AS_SC, which is supposed to support Unicode surrogate pair characters, or supplementary characters.
The Supplementary Character-Aware (SCA) collations — those ending with _SC or with _140_ in their names — do support supplementary characters. BUT, "support" only means that the built-in functions handle the surrogate pair as a single, supplementary code point instead a pair of surrogate code points. But support for sorting and comparison of supplementary characters actually started in SQL Server 2005 with the introduction of the version 90 collations.
even though they all have different UTF-16 codepoints. Is this due to collation? Why isn't there an exact match?
UTF-16 doesn't have code points, it is an encoding that encodes all Unicode code points.
Yes, this behavior is due to collation.
There is no exact match because (as you guessed), code point U+27B0 has no defined sort weight. Hence it is completely ignored and equates to an empty string or any other code point that has no sort weight.
There seems to be a group of "undefined" characters in the UTF-16 range, more than 1733 characters, and they are treated as the same by this collation.
Correct, though some only have a sort weight due to the accent sensitivity of the collation. You would get even more matches if you used Latin1_General_100_CI_AI_SC. And, to be clear, the UTF-16 "range" is all 1,114,112 Unicode code points.
The two queries below have different results ... Why is this?
I can't (yet!) explain why = vs LIKE returns different sets of matches, but there is 1 more character that equates to the 3 that you currently have:
SELECT KeyValue, CONVERT(VARBINARY(40), [KeyValue])
FROM (VALUES (N'π' COLLATE Latin1_General_100_CI_AS_SC), (N'ℼ'), (N'ᴨ'),
(N'Π')) t([KeyValue])
WHERE KeyValue = N'π'; -- 4 rows
SELECT KeyValue, CONVERT(VARBINARY(40), [KeyValue])
FROM (VALUES (N'π' COLLATE Latin1_General_100_CI_AS_SC), (N'ℼ'), (N'ᴨ'),
(N'Π')) t([KeyValue])
WHERE KeyValue LIKE N'π'; -- 3 rows
This is the weirdest of all. This query: ... returns ALMOST ALL records in the table
SELECT 1 WHERE NCHAR(0x27B0) = NCHAR(0x0000) COLLATE Latin1_General_100_CI_AS_SC;
-- 1
SELECT 2 WHERE NCHAR(0x27B0) = N'' COLLATE Latin1_General_100_CI_AS_SC;
-- 2
SELECT 3 WHERE NCHAR(0x27B0) = NCHAR(0x27B0) + NCHAR(0x27B0) + NCHAR(0x27B0)
COLLATE Latin1_General_100_CI_AS_SC;
-- 3
SELECT 4 WHERE N'➰' = N'➰ ➰ ➰ ➰' COLLATE Latin1_General_100_CI_AS_SC;
-- 4
SELECT 5 WHERE N'➰' LIKE N'➰ ➰ ➰ ➰' COLLATE Latin1_General_100_CI_AS_SC;
-- NO ROWS RETURNED!! (spaces matter here due to LIKE)
SELECT 6 WHERE N'➰' LIKE N'➰➰➰➰➰➰' COLLATE Latin1_General_100_CI_AS_SC;
-- 6
This, again, has something to do with the fact that "➰" has no sort weight defined. Of course, neither does æ, Þ, ß, LJ, etc.
I will update this answer once I figured out exactly what LIKE is doing differently than =.
For more info, please see:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
Collations Info

Related

SQL Server string comparison with equals sign and equals or greater in the strings [duplicate]

I have seen prefix N in some insert T-SQL queries. Many people have used N before inserting the value in a table.
I searched, but I was not able to understand what is the purpose of including the N before inserting any strings into the table.
INSERT INTO Personnel.Employees
VALUES(N'29730', N'Philippe', N'Horsford', 20.05, 1),
What purpose does this 'N' prefix serve, and when should it be used?
It's declaring the string as nvarchar data type, rather than varchar
You may have seen Transact-SQL code that passes strings around using
an N prefix. This denotes that the subsequent string is in Unicode
(the N actually stands for National language character set). Which
means that you are passing an NCHAR, NVARCHAR or NTEXT value, as
opposed to CHAR, VARCHAR or TEXT.
To quote from Microsoft:
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
If you want to know the difference between these two data types, see this SO post:
What is the difference between varchar and nvarchar?
Let me tell you an annoying thing that happened with the N' prefix - I wasn't able to fix it for two days.
My database collation is SQL_Latin1_General_CP1_CI_AS.
It has a table with a column called MyCol1. It is an Nvarchar
This query fails to match Exact Value That Exists.
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = 'ESKİ'
// 0 result
using prefix N'' fixes it
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = N'ESKİ'
// 1 result - found!!!!
Why? Because latin1_general doesn't have big dotted İ that's why it fails I suppose.
1. Performance:
Assume your where clause is like this:
WHERE NAME='JON'
If the NAME column is of any type other than nvarchar or nchar, then you should not specify the N prefix. However, if the NAME column is of type nvarchar or nchar, then if you do not specify the N prefix, then 'JON' is treated as non-unicode. This means the data type of NAME column and string 'JON' are different and so SQL Server implicitly converts one operand’s type to the other. If the SQL Server converts the literal’s type
to the column’s type then there is no issue, but if it does the other way then performance will get hurt because the column's index (if available) wont be used.
2. Character set:
If the column is of type nvarchar or nchar, then always use the prefix N while specifying the character string in the WHERE criteria/UPDATE/INSERT clause. If you do not do this and one of the characters in your string is unicode (like international characters - example - ā) then it will fail or suffer data corruption.
Assuming the value is nvarchar type for that only we are using N''

unable to update nvarchar(50) having czech letters in it [duplicate]

I have seen prefix N in some insert T-SQL queries. Many people have used N before inserting the value in a table.
I searched, but I was not able to understand what is the purpose of including the N before inserting any strings into the table.
INSERT INTO Personnel.Employees
VALUES(N'29730', N'Philippe', N'Horsford', 20.05, 1),
What purpose does this 'N' prefix serve, and when should it be used?
It's declaring the string as nvarchar data type, rather than varchar
You may have seen Transact-SQL code that passes strings around using
an N prefix. This denotes that the subsequent string is in Unicode
(the N actually stands for National language character set). Which
means that you are passing an NCHAR, NVARCHAR or NTEXT value, as
opposed to CHAR, VARCHAR or TEXT.
To quote from Microsoft:
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
If you want to know the difference between these two data types, see this SO post:
What is the difference between varchar and nvarchar?
Let me tell you an annoying thing that happened with the N' prefix - I wasn't able to fix it for two days.
My database collation is SQL_Latin1_General_CP1_CI_AS.
It has a table with a column called MyCol1. It is an Nvarchar
This query fails to match Exact Value That Exists.
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = 'ESKİ'
// 0 result
using prefix N'' fixes it
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = N'ESKİ'
// 1 result - found!!!!
Why? Because latin1_general doesn't have big dotted İ that's why it fails I suppose.
1. Performance:
Assume your where clause is like this:
WHERE NAME='JON'
If the NAME column is of any type other than nvarchar or nchar, then you should not specify the N prefix. However, if the NAME column is of type nvarchar or nchar, then if you do not specify the N prefix, then 'JON' is treated as non-unicode. This means the data type of NAME column and string 'JON' are different and so SQL Server implicitly converts one operand’s type to the other. If the SQL Server converts the literal’s type
to the column’s type then there is no issue, but if it does the other way then performance will get hurt because the column's index (if available) wont be used.
2. Character set:
If the column is of type nvarchar or nchar, then always use the prefix N while specifying the character string in the WHERE criteria/UPDATE/INSERT clause. If you do not do this and one of the characters in your string is unicode (like international characters - example - ā) then it will fail or suffer data corruption.
Assuming the value is nvarchar type for that only we are using N''

WHERE equals condition returns mapped Unicode (fullwidth) results

We are querying a SQL Server database for names that are stored in a nvarchar column. In this table, we have two values that are conflicting with each other. Word and Word. The first one is made out of full width Latin letters.
When we try to select the ASCII name, the Unicode version also returns. This causes conflicts as the query should only be able to return one row. Below is a query which can be used to reproduce the results:
SELECT CASE WHEN N'Word' = N'Word' THEN 1 ELSE 0 END;
This query returns 1, while we expect it to return 0. It seems that SQL Server maps Unicode based versions of each letter to their ASCII variant.
Is there a way to disable this mapping between the ASCII and Unicode characters? While still being able to ignore the capitalization.
When we try to select the ASCII name, the Unicode version also returns.
This statement is a bit of a misunderstanding of how encodings work. ASCII is an 8-bit encoding and a character set. It is values 0 - 127 and is common across most code pages and Unicode. However, it really only applies to VARCHAR data. When using NVARCHAR, then all characters are Unicode, even if those characters are found in other character sets. So here, you are only getting Unicode characters returned since NVARCHAR only holds Unicode characters (encoded as UTF-16 Little Endian). It just so happens that the ASCII character set was duplicated as a subset of Unicode.
Meaning, what you are really saying here is that you want the regular Latin characters only, not the fullwidth version.
It seems that SQL Server maps Unicode based versions of each letter to their ASCII variant.
Yes and no. Windows and SQL Server can map Unicode characters to similar looking characters within an 8-bit code page, but that only happens when converting a Unicode string to an 8-bit code page (or from one code page to another). That is not happening here. Here, again, you are only dealing with Unicode. It just so happens that both regular and fullwidth forms of the US English alphabet are considered equal when the Collation is Width Insensitive. And based on your question and the test case (two separate things since a column's Collation is used when querying a column, but the DB's default Collation is used when dealing only with string literals and/or variables), it is clear that the Collations you are using (which could both be the same Collation) are Width Insensitive.
To fix this, please do not use a binary Collation. Using a binary Collation is the unfortunately commonly-accepted go-to answer to fix queries when people get more matches than they were expecting. And sometimes it is the correct answer, but more often than not, such as with this question, it isn't.
You simply need to add "width sensitivity" to the Collation that you are using. You can find the column's Collation with the following query, just fill in the correct table name and column name:
SELECT col.[collation_name]
FROM sys.columns col
WHERE col.[object_id] = OBJECT_ID(N'<schema_name>.<table_name>')
AND col.[name] = N'<column_name>';
If the Collation is a Windows Collation (i.e. the name does not start with SQL_) then you might just be able to add _WS to the end of the Collation name. For example:
Latin1_General_100_CS_AS --> Latin1_General_100_CS_AS_WS
If the Collation is a SQL Server Collation (i.e. name does start with SQL_), then none of those allow for width sensitivity and you should choose an equivalent Windows Collation. If the Collation is SQL_Latin1_General_CP1_*, then try the same thing start with Latin1_General_100_.
-- current Collation (no width sensitivity)
SELECT CASE WHEN N'Word' = N'Word' COLLATE Latin1_General_100_CI_AS THEN 1
ELSE 0 END;
-- 1
-- add width sensitivity
SELECT CASE WHEN N'Word' = N'Word' COLLATE Latin1_General_100_CI_AS_WS THEN 1
ELSE 0 END;
-- 0
-- confirm case INsensitivity
SELECT CASE WHEN N'WORD' = N'Word' COLLATE Latin1_General_100_CI_AS_WS THEN 1
ELSE 0 END;
-- 1
For more details on why you should first attempt to get the correct sensitivity before using a binary Collation, please see the following post of mine:
No, Binary Collations are not Case-Sensitive
You need to use COLLATION.
Follow my examples and find out which collation is suitable for you
This collation returns 1
SELECT CASE WHEN N'Word' COLLATE Latin1_General_CI_AS = N'Word' COLLATE Latin1_General_CI_AS THEN 1 ELSE 0 END
This collation returns 0
SELECT CASE WHEN N'Word' COLLATE SQL_Latin1_General_Cp437_BIN = N'Word' COLLATE SQL_Latin1_General_Cp437_BIN THEN 1 ELSE 0 END
The collation specifier, tells SQL Server how to compare characters.
Find more detail here
List of collations
Because you may have more variety in your data, I can't tell what collation is best for you.

SQL Server: encoding of string constants in SQL

I have a problem with encoding of string constants in queries to NVARCHAR field in SQL Server v12.0.2. I need to use national characters (all in the same single code page e.g. cyrillic WIN1251) in queries without N prefix.
Is it possible?
Example:
1. CREATE TABLE TEST (VALUE NVARCHAR(100) COLLATE Cyrillic_General_CI_AS);
2. INSERT INTO TEST VALUES (N'привет мир');
3. INSERT INTO TEST VALUES ('привет мир');
4. SELECT * FROM TEST;
This will return two rows:
| привет мир |
| ?????? ??? |
So the first insert works correctly, I expect the second to do the same because TEST.VALUE column collated in Cyrillic_General_CI_AS. But it looks like national characters ignores field collation and use code page from somewhere else.
I realize that in this case I won't be able to use characters from more than one code page and languages that doesn't fit 1-byte encoding, but that is fine for me. Other option is to modify all queries to use N prefix before string constants, but it is not possible.
Without the N prefix, the string is converted to the default code page of the database, not the table you're inserting into (see MSDN for details)
So either you should change database collation to Cyrillic_General_CI_AS, or find all the string constants and insert N prefix.

Unable to return query Thai data

I have a table with columns that contain both thai and english text data. NVARCHAR(255).
In SSMS I can query the table and return all the rows easy enough. But if I then query specifically for one of the Thai results it returns no rows.
SELECT TOP 1000 [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
Returns
Province District SubDistrict Branch
อุตรดิตถ์ ลับแล ศรีพนมมาศ Northern
Bangkok Khlong Toei Khlong Tan SSS1
But this query:
SELECT [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
where [Province] LIKE 'อุตรดิตถ์'
Returns no rows.
What do I need o do to get the expected results.
The collation set is Latin1_General_CI_AS.
The data is displayed and inserted with no errors just can't search.
Two problems:
The string being passed into the LIKE clause is VARCHAR due to not being prefixed with a capital "N". For example:
SELECT 'อุตรดิตถ์' AS [VARCHAR], N'อุตรดิตถ์' AS [NVARCHAR]
-- ????????? อุตรดิตถ
What is happening here is that when SQL Server is parsing the query batch, it needs to determine the exact type and value of all literals / constants. So it figures out that 12 is an INT and 12.0 is a NUMERIC, etc. It knows that N'ดิ' is NVARCHAR, which is an all-inclusive character set, so it takes the value as is. BUT, as noted before, 'ดิ' is VARCHAR, which is an 8-bit encoding, which means that the character set is controlled by a Code Page. For string literals and variables / parameters, the Code Page used for VARCHAR data is the Database's default Collation. If there are characters in the string that are not available on the Code Page used by the Database's default Collation, they are either converted to a "best fit" mapping, if such a mapping exists, else they become the default replacement character: ?.
Technically speaking, since the Database's default Collation controls string literals (and variables), and since there is a Code Page for "Thai" (available in Windows Collations), then it would be possible to have a VARCHAR string containing Thai characters (meaning: 'ดิ', without the "N" prefix, would work). But that would require changing the Database's default Collation, and that is A LOT more work than simply prefixing the string literal with "N".
For an in-depth look at this behavior, please see my two-part series:
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part A of 2: “Duck”)
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part B of 2: “Rabbit”)
You need to add the wildcard characters to both ends:
N'%อุตรดิตถ์%'
The end result will look like:
WHERE [Province] LIKE N'%อุตรดิตถ์%'
EDIT:
I just edited the question to format the "results" to be more readable. It now appears that the following might also work (since no wildcards are being used in the LIKE predicate in the question):
WHERE [Province] = N'อุตรดิตถ์'
EDIT 2:
A string (i.e. something inside of single-quotes) is VARCHAR if there is no "N" prefixed to the string literal. It doesn't matter what the destination datatype is (e.g. an NVARCHAR(255) column). The issue here is the datatype of the source data, and that source is a string literal. And unlike a string in .NET, SQL Server handles 'string' as an 8-bit encoding (VARCHAR; ASCII values 0 - 127 same across all Code Pages, Extended ASCII values 128 - 255 determined by the Code Page, and potentially 2-byte sequences for Double-Byte Character Sets) and N'string' as UTF-16 Little Endian (NVARCHAR; Unicode character set, 2-byte sequences for BMP characters 0 - 65535, two 2-byte sequences for Code Points above 65535). Using 'string' is the same as passing in a VARCHAR variable. For example:
DECLARE #ASCII VARCHAR(20);
SET #ASCII = N'อุตรดิตถ์';
SELECT #ASCII AS [ImplicitlyConverted]
-- ?????????
Could be a number of things!
Fist of print out the value of the column and your query string in hex.
SELECT convert(varbinary(20)Province) as stored convert(varbinary(20),'อุตรดิตถ์') as query from allDistricsBranches;
This should give you some insight to the problem. I think the most likely cause is the ั, ิ, characters being typed in the wrong sequence. They are displayed as part of the main letter but are stored internally as separate characters.

Resources