We are testing our application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.
On both Latin and Japanese-collated systems the following equality is true (U+3422):
N'㐢㐢㐢㐢' = N'㐢㐢㐢'
but the following is not (U+30C1):
N'チチチチ' = N'チチチ'
This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don't know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?
Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn't it will be considered equal to the empty string.
Sorting it all Out: The jury will give this string no weight
In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use Chinese_Simplified_Pinyin_100_CI_AS or Chinese_Simplified_Stroke_Order_100_CI_AS:
You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.
So the following SQL statement would work as expected:
select * from MyTable where N'' = N'㐀' COLLATE Chinese_Simplified_Stroke_Order_100_CI_AS;
A list of all supported collations can be found in MSDN:
SQL Server 2008 Books Online: Windows Collation Name
That character U+3422 is from the CJK Unified Ideographs tables, which are a relatively obscure (and politically loaded) part of the unicode standard. My guess is that SQL Server simply does not know that part - or perhaps even intentionally does not implement it due to political considerations.
Edit: looks like my guess was wrong and the real problem was that neither Latin nor Japanese collation define weights for that character.
If you look at the Unihan data page, the character appears to only have the "K-Source" field which corresponds to the South Korean government's mappings.
My guess is that MS SQL asks "is this character a Chinese character?" If so then use the Japanese sorting standard, discarding the character if the collation number isn't available - likely a SQL server-specific issue.
I very much doubt it's a political dispute as another poster suggested as the character doesn't even have a Taiwan or Hong Kong encoding mapping.
More technical info: The J-Source (the Japanese sorting order prescribed by the Japanese government) is blank as it probably was only used in classical Korean Hanja (chinese characters which are now only used in some contexts.)
The Japanese government's JIS sorting standards generally sort Kanji characters by the Japanese On reading (which is usually the approximated Chinese pronunciation when the characters were imported into Japan.) But this character probably isn't used much in Japanese and may not even have a Japanese pronunciation to associate with it, so hasn't been added to the data.
Related
By SQL Server's documentation (and legacy documentation), a nvarchar field without _SC collation, should use the UCS-2 ENCODING.
Starting with SQL Server 2012 (11.x), when a Supplementary Character
(SC) enabled collation is used, these data types store the full range
of Unicode character data and use the UTF-16 character encoding. If a
non-SC collation is specified, then these data types store only the
subset of character data supported by the UCS-2 character encoding.
It also states that the UCS-2 ENCODING stores only the subset characters supported by UCS-2. From wikipedia UCS-2 specification:
UCS-2, uses a single code value [...] between 0 and 65,535 for each
character, and allows exactly two bytes (one 16-bit word) to represent
that value. UCS-2 thereby permits a binary representation of every
code point in the BMP that represents a character. UCS-2 cannot
represent code points outside the BMP.
So, by the specifications above, seems that I won't be able to store a emoji like: 😍 which have a value of 0x1F60D (or 128525 in decimal, way above 65535 limit of UCS-2). But on SQL Server 2008 R2 or SQL Server 2019 (both with the default SQL_Latin1_General_CP1_CI_AS COLLATION), on a nvarchar field, it's perfectly stored and returned (although not supported on comparisons with LIKE or =):
SMSS doesn't render emoji correctly, but here is the value copied and pasted from query result: 😍
So my questions are:
Is nvarchar field really using USC-2 on SQL Server 2008 R2 (I also tested on SQL Server 2019, with same non _SC collations and got same results)?
Is Microsoft's documentation of nchar/nvarchar misleading about "then these data types store only the subset of character data supported by the UCS-2 character encoding"?
Does UCS-2 ENCODING support or not code points beyond 65535?
How SQL Server was able to correctly store and retrieve this field's data, when it's outside the support of UCS-2 ENCODING?
NOTE: Server's Collation is SQL_Latin1_General_CP1_CI_AS and Field's Collation is Latin1_General_CS_AS.
NOTE 2: The original question stated tests about SQL Server 2008. I tested and got same results on a SQL Server 2019, with same respective COLLATIONs.
NOTE 3: Every other character I tested, outside UCS-2 supported range, is behaving on the same way. Some are: 𝕂, 😂, 𨭎, 𝕬, 𝓰
There are several clarifications to make here regarding the MS documentation snippets posted in the question, and for the sample code, for the questions themselves, and for statements made in the comments on the question. Most of the confusion can be cleared up, I believe, by the information provided in the following post of mine:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
First things first (which is the only way it can be, right?): I'm not insulting the people who wrote the MS documentation as SQL Server alone is a huge product and there is a lot to cover, etc, but for the moment (until I get a chance to update it), please read the "official" documentation with a sense of caution. There are several misstatements regarding Collations / Unicode.
UCS-2 is an encoding that handles a subset of the Unicode character set. It works in 2-byte units. With 2 bytes, you can encode values 0 - 65535. This range of code points is known as the BMP (Basic Multilingual Plane). The BMP is all of the characters that are not Supplementary Characters (because those are supplementary to the BMP), but it does contain a set of code points that are exclusively used to encode Supplementary Characters in UTF-16 (i.e. the 2048 surrogate code points). This is a complete subset of UTF-16.
UTF-16 is an encoding that handles all of the Unicode character set. It also works in 2-byte units. In fact, there is no difference between UCS-2 and UTF-16 regarding the BMP code points and characters. The difference is that UTF-16 makes use of those 2048 surrogate code points in the BMP to create surrogate pairs which are the encodings for all Supplementary Characters. While Supplementary Characters are 4-bytes (in UTF-8, UTF-16, and UTF-32), they are really two 2-byte code units when encoding in UTF-16 (likewise, they are four 1-byte units in UTF-8, and one 4-byte in UTF-32).
Since UTF-16 merely extends what can be done with UCS-2 (by actually defining the usage of the surrogate code points), there is absolutely no difference in the byte sequences that can be stored in either case. All 2048 surrogate code points used to create Supplementary Characters in UTF-16 are valid code points in UCS-2, they just don't have any defined usage (i.e. interpretation) in UCS-2.
NVARCHAR, NCHAR, and the deprecated-so-do-NOT-use-it-NTEXT datatypes all store Unicode characters encoded in UCS-2 / UTF-16. From a storage perspective there is absolutely NO difference. So, it doesn't matter if something (even outside of SQL Server) says that it can store UCS-2. If it can do that, then it can inherently store UTF-16. In fact, while I have not had a chance to update the post linked above, I have been able to store and retrieve, as expected, emojis (most of which are Supplementary Characters) in SQL Server 2000 running on Windows XP. There were no Supplementary Characters defined until 2003, I think, and certainly not in 1999 when SQL Server 2000 was being developed. In fact (again), UCS-2 was only used in Windows / SQL Server because Microsoft pushed ahead with development prior to UTF-16 being finalized and published (and as soon as it was, UCS-2 became obsolete).
The only difference between UCS-2 and UTF-16 is that UTF-16 knows how to interpret surrogate pairs (comprised of a pair of surrogate code points, so at least they're appropriately named). This is where the _SC collations (and, starting in SQL Server 2017, also version _140_ collations which include support for Supplementary Characters so none of them have the _SC in their name) come in: they allow the built-in SQL Server functions to correctly interpret Supplementary Characters. That's it! Those collations have nothing to do with storing and retrieving Supplementary Characters, nor do they even have anything to do with sorting or comparing them (even though the "Collation and Unicode Support" documentation says specifically that this is what those collations do — another item on my "to do" list to fix). For collations that have neither _SC nor _140_ in their name (though the new-as-of-SQL Server 2019 Latin1_General_100_BIN2_UTF8 might be grey-area, at least, I remember there being some inconsistency either there or with the Japanese_*_140_BIN2 collations), the built-in functions only handle BMP code points (i.e. UCS-2).
Not "handling" Supplementary Characters means not interpreting a valid sequence of two surrogate code points as actually being a singular supplementary code point. So, for non-"SC" collations, BMP surrogate code point 1 (B1) and BMP surrogate code point 2 (B2) are just those two code points, neither one of which is defined, hence they appear as two "nothing"s (i.e. B1 followed by B2). This is why it is possible to split a Supplementary Character in two using SUBSTRING / LEFT / RIGHT because they won't know to keep those two BMP code points together. But an "SC" collation will read those code points B1 and B2 from disk or memory and see a single Supplementary code point S. Now it can be handled correctly via SUBSTRING / CHARINDEX / etc.
The NCHAR() function (not the datatype; yes, poorly named function ;) is also sensitive to whether or not the default collation of the current database supports Supplementary Characters. If yes, then passing in a value between 65536 and 1114111 (the Supplementary Character range) will return a non-NULL value. If not, then passing in any value above 65535 will return NULL. (Of course, it would be far better if NCHAR() just always worked, given that storing / retrieving always works, so please vote for this suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation ).
Fortunately, you don't need an "SC" collation to output a Supplementary Character. You can either paste in the literal character, or convert the UTF-16 Little Endian encoded surrogate pair, or use the NCHAR() function to output the surrogate pair. The following works in SQL Server 2000 (using SSMS 2005) running on Windows XP:
SELECT N'💩', -- 💩
CONVERT(VARBINARY(4), N'💩'), -- 0x3DD8A9DC
CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- 💩 (regardless of DB Collation)
NCHAR(0xD83D) + NCHAR(0xDCA9) -- 💩 (regardless of DB Collation)
For more details on creating Supplementary Characters when using non-"SC" collations, please see my answer to the following DBA.SE question:
How do I set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character?
None of this affects what you see. If you store a code point, then it's there. How it behaves — sorting, comparison, etc — is controlled by collations. But, how it appears is controlled by fonts and the OS. No font can contain all characters, so different fonts contain different sets of characters, with a lot of overlap on the more widely used characters. However, if a font has a particular byte sequence mapped, then it can display that character. This is why the only work required to get Supplementary Characters displaying correctly in SQL Server 2000 (using SSMS 2005) running on Windows XP was to add a font containing the characters and doing one or two minor registry edits (no changes to SQL Server).
Supplementary Characters in SQL_* collations and collations without a version number in their name have no sort weights. Hence, they all equate to each other as well as to any other BMP code points that have no sort weights (including "space" (U+0020) and "null" (U+0000)). They started to fix this in the version _90_ collations.
SSMS has nothing to do with any of this, outside of possibly needing the font used for the query editor and/or grid results and/or errors + messages changed to one that has the desired characters. (SSMS doesn't render anything outside of maybe spatial data; characters are rendered by the display driver + font definitions + maybe something else).
Therefore, the following statement in the documentation (from the question):
If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.
is both nonsensical and incorrect. They were probably intending to say the datatypes would only store a subset of the UTF-16 encoding (since UCS-2 is the subset). Also, even if it said "UTF-16 character encoding" it would still be wrong because the bytes that you pass in will be stored (assuming enough free space in the column or variable).
I have the following two fields in a Sql Server table:
When I add some test data with accented characters into the field, it actually stores them! I thought I had to change the column from VARCHAR to NVARCHAR to accept accented characters, etc?
Basically, I thought:
VARCHAR = ASCII
NVARCHAR = Unicode
So is this a case where façade etc are actually ASCII .. while some other characters would error (if VARCHAR)?
I can see the ç and é characters in the extended ASCII chart (link above) .. so does this mean ASCII includes 0->127 or 0->255?
(Side thought: I guess I'm happy with accepting 0->255 and stripping out anything else.)
Edit
DB collation: Latin1_General_CI_AS
Server Version: 12.0.5223.6
Server Collation: SQL_Latin1_General_CP1_CI_AS
First the details of what Sql Server is doing.
VARCHAR stores single-byte characters using a specific collation. ASCII only uses 7 bits, or half of the possible values in a byte. A collation references a specific code page (along with sorting and equating rules) to use the other half of the possible values in each byte. These code pages often include support for a limited and specific set of accented characters. If the code page used for your data supports an accent character, you can do it; if it doesn't, you see weird results (unprintable "box" or ? characters). You can even output data stored in one collation as if it had been stored in another, and get really weird stuff that way (but don't do this).
NVARCHAR is unicode, but there is still some reliance on collations. In most situations, you will end up with UTF-16, which does allow for the full range of unicode characters. Certain collations will result instead in UCS-2, which is slightly more limited. See the nchar/nvarchar documentation for more information.
As an additional quirk, the upcoming Sql Server 2019 will include support for UTF-8 in char and varchar types when using the correct collation.
Now to answer the question.
In some rare cases, where you are sure your data only needs to support accent characters originating from a single specific (usually local) culture, and only those specific accent characters, you can get by with the varchar type.
But be very careful making this determination. In an increasingly global and diverse world, where even small businesses want to take advantage of the internet to increase their reach, even within their own community, using an insufficient encoding can easily result in bugs and even security vulnerabilities. The majority of situations where it seems like a varchar encoding might be good enough are really not safe anymore.
Personally, about the only place I use varchar today is mnemonic code strings that are never shown to or provided by an end user; things that might be enum values in procedural code. Even then, this tends to be legacy code, and given the option I'll use integer values instead, for faster joins and more efficient memory use. However, the upcoming UTF-8 support may change this.
VARCHAR is ASCII using the current system code page - so the set of characters you can save depends what code page.
NVARCHAR is UNICODE so you can store all the characters.
Or if it does not, then what is actually a Sql Server collation? Maybe my understanding of collation (as a concept) is wrong.
I do not wish to specify my collation to greek or icelandic or even western-european. I wish to be able to use any language that is supported in Unicode.
(I'm using MSSQL 2005)
UPDATE: Ok, I'm rephrasing the question: Is there a generic, culture-independent collation that can be used for texts of any culture? I know it will not contain culture-specific rules like 'ty' in Hungarian or ß=ss in German, but will provide consistent, mostly acceptable results.
Is there any collation that is not culture-specific?
Well, there's always a binary collation like Latin1_General_BIN2. It stores the code points in numerical order, which can be pretty arbitrary. It's not culture-specific though (despite the name).
It sounds like there isn't any intelligent way to sort data from multiple languages/cultures together so instead of a half-baked solution, all you can do is sort by the binary values.
This is a good article to know what is collation, short and sweet: SQL Server and Collation.
Collation is something which will allow you to compare and sort the data. As far as I can remember there is nothing like Unicode collation.
There is a default Unicode collation, the
"Default Unicode Collation Element Table (DUCET)",
described in the Unicode Collation Algorithm Technical Standard document
http://www.unicode.org/reports/tr10/.
But one calls it the default Unicode collation rather than
the Unicode collation because of course there is more than
one -- for example the unicode.org chart for Hungarian
http://www.unicode.org/cldr/charts/28/collation/hu.html
describes how Hungarian collation for Unicode
characters differs from the DUCET.
Since this question was asked the
SQL Server collations have become more Unicode-aware
https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017. Meanwhile some open-source DBMSs have gained the ability to support DUCET and other Unicode collations, by incorporating the ICU (International Components for Unicode) library.
I've spent a lot of time this evening trying to find guidance about which choice of collation to apply in my SQL Server 2008 R2 installation, but almost everything online basically says "choose what is right for you." Extremely unhelpful.
My context is new application development. I am not worrying about backward compatibility with a prior version of SQL Server (viz. <= 2005). I am very interested in storing data representing languages from around the globe - not just Latin based. What very little help I've found online suggests I should avoid all "SQL_" collations. This narrows my choice to using either a binary or "not binary" collation based on the Windows locale.
If I use binary, I gather I should use "BIN2." So this is my question. How do I determine whether I should use BIN2 or just "Latin1_General_100_XX_XX_XX"? My spider-sense tells me that BIN2 will provide collation that is "less accurate," but more generic for all languages (and fast!). I also suspect the binary collation is case sensitive, accent sensitive, and kana-sensitive (yes?). In contrast, I suspect the non-binary collation would work best for Latin-based languages.
The documentation doesn't support my claims above, I'm making educated guesses. But this is the problem! Why is the online documentation so thin that the choice is left to guesswork? Even the book "SQL Server 2008 Internals" discussed the variety of choices, without explaining why and when binary collation would be chosen (compared with non-binary windows collation). Criminy!!!
"SQL Server 2008 Internals" has a good discussion on the topic imho.
Binary collation is tricky, if you intend to support text search for human beings, you'd better go with non-binary. Binary is good to gain a tiny bit of performance if you have tuned everything else (architecture first) and in cases where case sensitivity and accent sensitivity are a desired behavior, like password hashes for instance. Binary collation is actually "more precise" in a sense that it does not consider similar texts. The sort orders you get out of there are good for machines only though.
There is only a slight difference between the SQL_* collations and the native windows ones. If you're not constrained with compatibility, go for the native ones as they are the way forward afaik.
Collation decides sort order and equality. You choose, what really best suits your users. It's understood that you will use the unicode types (like nvarchar) for your data to support international text. Collation affects what can be stored in a non-unicode column, which does not affect you then.
What really matters is that you avoid mixing collations in WHERE clause because that's where you pay the fine by not using indexes. Afaik there's no silver bullet collation to support all languages. You can either choose one for the majority of your users or go into localization support with different column for each language.
One important thing is to have the server collation the same as your database collation. It will make your life much easier if you plan to use temporary tables as temporary tables if created with "CREATE TABLE #ttt..." pick up the server collation and you'd run into collation conflicts which you'll need to solve with specifying an explicit collation. This has a performance impact too.
Please do not consider my answer as complete, but you should take into consideration the following points:
( as said by #Anthony) All text fields must use nvarchar data type. This will allow you to store any character from any language, as defined by UTF-8\unicode character set! If you do not do so, you will not be able to mix text from different origins (latin, cyrillic, arabic, etc) in your tables.
This said, your collation choice will mainly affect the following:
The collating sequence, or sorting rules to be set between characters such as 'e' and 'é', or 'c' and 'ç' (should they be considered as equal or not?). In some cases, collating sequences do consider specific letter combinations, just like in hungarian, where C and CS, or D, DZ and DZS, are considered independantly.
The way spaces (or other non letter characters) are analysed: which one is the correct 'alphabetical' order?
this one (spaces are considered as 'first rank' characters)?
San Juan
San Teodoro
Santa Barbara
or this one (spaces are not considered in the ordering)?
San Juan
Santa Barbara
San Teodoro
Collation also impacts on case sensitivity: do capital letters have to be considered as similar to small letters?
The best default collation for a global database (e.g. a website) is probably Latin1_General_CI_AS. More important than collation is making sure that all textual columns use the nvarchar data type.
As long as you use NVARCHAR columns (as you should for mixed international data), all *_BIN and *_BIN2 collations perform the same binary comparison/sorting based on the Unicode code points. It doesn't matter which one you pick. Latin1_General_BIN2 looks like a reasonable generic choice.
Source: http://msdn.microsoft.com/en-us/library/ms143350(v=sql.105).aspx
Does SQL Server's (2000) Soundex function work on Asian character sets? I used it in a query and it appears to have not worked properly but I realize that it could be because I don't know how to read Chinese...
Furthermore, are there any other languages where the function might have trouble working on? (Russian for example)
Thank you,Frank
Soundex is fairly specific to English - it may or may not work well on other languages. One example that happened in New Zealand was an attempt at patient name matching using Soundex. Unfortunately pacific island names did not work well with Soundex, in many cases hashing to the same small set of values. A different algorithm had to be used.
Your mileage may vary. On more recent versions of SQL Server you could write a CLR function to do some other computation.
By design it works best on English sentences using the ASCII character set. I have used it on a project in Romania where I replaced the Romanian special characters with corresponding ASCII characters that sound more or less the same. It is not perfect but in my case it was a lot better than nothing.
I think you will have no great success with applying SOUNDEX on Asian character sets.
I know that soundex in older versions of SQLServer ignored any non-english characters. I believe it didn't even handle Latin-1, let alone anything more exotic.
I never dealt with soundex much in SQL2k, all I know for certain was that it does not handle Arabic correctly. This likely extends to other non-latin character sets as well.
In any case, a soundex based algorithm is unlikely to yield acceptable results for non-english languages even aside from character set issues. Soundex was specifically designed to handle the English pronunciation of names (mostly those of Western European origin) and does not function particularly well outside of that use. You would often be better off researching any of several variants of soundex or other unrelated phonetic similarity algorithms which are designed to address the language(s) in question.
You may use an algorithm like Levenshtein distance. There are various implementations of the algorithm as user-defined functions which you may use within a SELECT statement.