SQL Server Collation Choices

SQL Server Collation Choices - sql-server

I've spent a lot of time this evening trying to find guidance about which choice of collation to apply in my SQL Server 2008 R2 installation, but almost everything online basically says "choose what is right for you." Extremely unhelpful.
My context is new application development. I am not worrying about backward compatibility with a prior version of SQL Server (viz. <= 2005). I am very interested in storing data representing languages from around the globe - not just Latin based. What very little help I've found online suggests I should avoid all "SQL_" collations. This narrows my choice to using either a binary or "not binary" collation based on the Windows locale.
If I use binary, I gather I should use "BIN2." So this is my question. How do I determine whether I should use BIN2 or just "Latin1_General_100_XX_XX_XX"? My spider-sense tells me that BIN2 will provide collation that is "less accurate," but more generic for all languages (and fast!). I also suspect the binary collation is case sensitive, accent sensitive, and kana-sensitive (yes?). In contrast, I suspect the non-binary collation would work best for Latin-based languages.
The documentation doesn't support my claims above, I'm making educated guesses. But this is the problem! Why is the online documentation so thin that the choice is left to guesswork? Even the book "SQL Server 2008 Internals" discussed the variety of choices, without explaining why and when binary collation would be chosen (compared with non-binary windows collation). Criminy!!!

"SQL Server 2008 Internals" has a good discussion on the topic imho.
Binary collation is tricky, if you intend to support text search for human beings, you'd better go with non-binary. Binary is good to gain a tiny bit of performance if you have tuned everything else (architecture first) and in cases where case sensitivity and accent sensitivity are a desired behavior, like password hashes for instance. Binary collation is actually "more precise" in a sense that it does not consider similar texts. The sort orders you get out of there are good for machines only though.
There is only a slight difference between the SQL_* collations and the native windows ones. If you're not constrained with compatibility, go for the native ones as they are the way forward afaik.
Collation decides sort order and equality. You choose, what really best suits your users. It's understood that you will use the unicode types (like nvarchar) for your data to support international text. Collation affects what can be stored in a non-unicode column, which does not affect you then.
What really matters is that you avoid mixing collations in WHERE clause because that's where you pay the fine by not using indexes. Afaik there's no silver bullet collation to support all languages. You can either choose one for the majority of your users or go into localization support with different column for each language.
One important thing is to have the server collation the same as your database collation. It will make your life much easier if you plan to use temporary tables as temporary tables if created with "CREATE TABLE #ttt..." pick up the server collation and you'd run into collation conflicts which you'll need to solve with specifying an explicit collation. This has a performance impact too.

Please do not consider my answer as complete, but you should take into consideration the following points:
( as said by #Anthony) All text fields must use nvarchar data type. This will allow you to store any character from any language, as defined by UTF-8\unicode character set! If you do not do so, you will not be able to mix text from different origins (latin, cyrillic, arabic, etc) in your tables.
This said, your collation choice will mainly affect the following:
The collating sequence, or sorting rules to be set between characters such as 'e' and 'é', or 'c' and 'ç' (should they be considered as equal or not?). In some cases, collating sequences do consider specific letter combinations, just like in hungarian, where C and CS, or D, DZ and DZS, are considered independantly.
The way spaces (or other non letter characters) are analysed: which one is the correct 'alphabetical' order?
this one (spaces are considered as 'first rank' characters)?
San Juan
San Teodoro
Santa Barbara
or this one (spaces are not considered in the ordering)?
San Juan
Santa Barbara
San Teodoro
Collation also impacts on case sensitivity: do capital letters have to be considered as similar to small letters?

The best default collation for a global database (e.g. a website) is probably Latin1_General_CI_AS. More important than collation is making sure that all textual columns use the nvarchar data type.

As long as you use NVARCHAR columns (as you should for mixed international data), all *_BIN and *_BIN2 collations perform the same binary comparison/sorting based on the Unicode code points. It doesn't matter which one you pick. Latin1_General_BIN2 looks like a reasonable generic choice.
Source: http://msdn.microsoft.com/en-us/library/ms143350(v=sql.105).aspx

Related

Databases - Why case insensitive?

I saw one or two threads talking globally about case sensitivity, but my question is more specific.
I understand the interest of case insensitive searches on text values for example.
But why would we use case-insensitive database names, tables and columns?
Isn't it going to lead to mistakes? Script languages that use databases are all case-sensitive, so for example if we didn't use the right case for a field it will not be found...

The SQL:2008 and SQL-99 standards define databases to be case-insensitive for identifiers unless they are quoted. I've found most ORMs will quote identifiers in the SQL they generate.
However, as you probably know not all relational databases strictly adhere to the standards. DB2 and Oracle are 100% compliant. PostgreSQL is mostly compliant, except for the fact that it automatically lowercases anything that isn't quoted (which personally I prefer.)
mySQL gets a bit weird, since it stores each table as a file on the file system. For this reason, it's subject to the case sensitivity of the file system. On Windows:
CREATE TABLE FOO (a INTEGER);
CREATE TABLE 'Foo' (a INTEGER); -- Errors out, already exists
Where-as on Linux:
CREATE TABLE FOO (a INTEGER); -- Creates FOO table
CREATE TABLE 'Foo' (a INTEGER); -- Creates Foo table
SQL Server is even stranger. It will preserve the case on creation, however let you refer to it in any way after (even if you quote the name!) You can't create two tables whose only difference is their casing. Note: SQL Server does have configuration options that control this stuff though, as case-sensitivity of identifiers will depend on the default collation of the database instance. How confusing!
While for the most part I agree with you that computers (programming languages, databases, file systems, URLs, passwords, etc) should be case-sensitive, all systems are implemented independently and may or may not adhere to standards that may or may not exist. Implementing a case-senstive database is definitely possible, if you know the ins and outs of your particular database system and how it behaves.
It's really your responsibility to implement your system in a way that works for you, and not the entire technology industry to implement everything in a consistent way to make your life easier.

The main advantage of using case sensitivity is that when we deploy it on the client site, our DB works regardless whether the client's SQL Server is set up case sensitive or not, so yes it really isn't a good idea and I don't know why anyone would use case-insensitve database tables/columns.

If you would redo all the it industry today, with the knowledge and technology you might default to do everything case sensitive with the only exception of things especially asked for being not case sensitive.
But back in the days before I was born and even when I started working (ok, playing) with computers many computers couldn't even differentiate between upper and lower case letters. I build a might complicated card to plug into my fake apple II to make it understand the difference.
So I guess in these days having something like a difference between upper and lower case was something like having a retina display nowadays. Its cool if you have it. And in 10 years we might ask why anybody ever created an application without such displays in mind, but today it just isn't that relevant.
Same is true for databases (and file systems) since many of them and their respective standards go back to the 70s at least.

Is there such thing as 'Unicode collation'?

Or if it does not, then what is actually a Sql Server collation? Maybe my understanding of collation (as a concept) is wrong.
I do not wish to specify my collation to greek or icelandic or even western-european. I wish to be able to use any language that is supported in Unicode.
(I'm using MSSQL 2005)
UPDATE: Ok, I'm rephrasing the question: Is there a generic, culture-independent collation that can be used for texts of any culture? I know it will not contain culture-specific rules like 'ty' in Hungarian or ß=ss in German, but will provide consistent, mostly acceptable results.
Is there any collation that is not culture-specific?

Well, there's always a binary collation like Latin1_General_BIN2. It stores the code points in numerical order, which can be pretty arbitrary. It's not culture-specific though (despite the name).
It sounds like there isn't any intelligent way to sort data from multiple languages/cultures together so instead of a half-baked solution, all you can do is sort by the binary values.

This is a good article to know what is collation, short and sweet: SQL Server and Collation.
Collation is something which will allow you to compare and sort the data. As far as I can remember there is nothing like Unicode collation.

There is a default Unicode collation, the
"Default Unicode Collation Element Table (DUCET)",
described in the Unicode Collation Algorithm Technical Standard document
http://www.unicode.org/reports/tr10/.
But one calls it the default Unicode collation rather than
the Unicode collation because of course there is more than
one -- for example the unicode.org chart for Hungarian
http://www.unicode.org/cldr/charts/28/collation/hu.html
describes how Hungarian collation for Unicode
characters differs from the DUCET.
Since this question was asked the
SQL Server collations have become more Unicode-aware
https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017. Meanwhile some open-source DBMSs have gained the ability to support DUCET and other Unicode collations, by incorporating the ICU (International Components for Unicode) library.

Need a case insensitive collation where ss != ß

For a specific column in a database running on SQL Server Express 2012 I need a collation where ss and ß are not considered the same when comparing strings. Also ä and ae, ö and oe and ü and ue should be considered different respectively. Latin1_General_CI_AS provides the latter, but ss and ß are not distinguished. That is, WHERE ThatColumn = 'Fass' would yield both Fass and Faß.
I would simply stick to BIN/BIN2, but I need case insensitivity. If nothing else works I'll have to use Latin1_General_BIN/Latin1_General_BIN2 and make sure everything is uppercase or lowercase myself. It would mean more work as I need to be able to retrieve the version with proper casing as well.
But if there's a collation that does what I need, please let me know. Thanks in advance!
Update:
More information about the requirements: the database contains personal names from a legacy system which only supported ASCII characters. That is, names like Müller and Faß are stored as Mueller and Fass. In the new system the user will have a function to rename those persons, e.g. rename "Mueller" to "Müller". To find the entities that need renaming I need to search for rows containing e.g. "Fass". But as it is now the query also returns "Faß", which is not what I want. I still need/want case insensitivity as the user should be able to search for "fass" and still get "Fass".
There's more to the system, but I can definitively say that I need to distinguish between ss and ß, ä and ae etc.

The collation SQL_Latin1_General_CP1_CI_AS considers 'ss' to be different from 'ß', so it may work for you. It's a legacy collation, so it might create incompatibilities with operating systems and application platforms. It might also have other quirks I don't know of.
A few years ago, I sketched out a kludgy workaround for a similar problem, and you can take a look at it here. (Click on the "Workarounds (1)" tab.) The problem in that case was regarding uniqueness in a key column, not string comparison results, so to apply it in your situation (and compare two columns to do a single string comparison) might be unfeasible.

Why does SQL Server consider N'㐢㐢㐢㐢' and N'㐢㐢㐢' to be equal?

We are testing our application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.
On both Latin and Japanese-collated systems the following equality is true (U+3422):
N'㐢㐢㐢㐢' = N'㐢㐢㐢'
but the following is not (U+30C1):
N'チチチチ' = N'チチチ'
This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don't know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?

Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn't it will be considered equal to the empty string.
Sorting it all Out: The jury will give this string no weight
In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use Chinese_Simplified_Pinyin_100_CI_AS or Chinese_Simplified_Stroke_Order_100_CI_AS:
You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.
So the following SQL statement would work as expected:
select * from MyTable where N'' = N'㐀' COLLATE Chinese_Simplified_Stroke_Order_100_CI_AS;
A list of all supported collations can be found in MSDN:
SQL Server 2008 Books Online: Windows Collation Name

That character U+3422 is from the CJK Unified Ideographs tables, which are a relatively obscure (and politically loaded) part of the unicode standard. My guess is that SQL Server simply does not know that part - or perhaps even intentionally does not implement it due to political considerations.
Edit: looks like my guess was wrong and the real problem was that neither Latin nor Japanese collation define weights for that character.

If you look at the Unihan data page, the character appears to only have the "K-Source" field which corresponds to the South Korean government's mappings.
My guess is that MS SQL asks "is this character a Chinese character?" If so then use the Japanese sorting standard, discarding the character if the collation number isn't available - likely a SQL server-specific issue.
I very much doubt it's a political dispute as another poster suggested as the character doesn't even have a Taiwan or Hong Kong encoding mapping.
More technical info: The J-Source (the Japanese sorting order prescribed by the Japanese government) is blank as it probably was only used in classical Korean Hanja (chinese characters which are now only used in some contexts.)
The Japanese government's JIS sorting standards generally sort Kanji characters by the Japanese On reading (which is usually the approximated Chinese pronunciation when the characters were imported into Japan.) But this character probably isn't used much in Japanese and may not even have a Japanese pronunciation to associate with it, so hasn't been added to the data.

Are there benefits to a case sensitive database?

We have just 'migrated' an SQL Server 2005 database from DEVEL into TEST. Somehow during the migration process the DB was changed from case insensitive to sensitive - so most SQL queries broke spectacularly.
What I would like to know, is - are there any clear benefits to having a case sensitive schema?
NOTE: By this I mean table names, column names, stored proc names etc. I am NOT referring to the actually data being stored in the tables.
At first inspection, I cannot find a valid reason that offers benefits over case insensitivity.

I just found out why WE make it case sensitive. It is to ensure that when we deploy it on the client site, our DB works regardless whether the client's SQL Server is set up case sensitive or not.
That is one answer I wasn't expecting.

I really can't think of any good reason SQL identifiers should be case sensitive. I can think of one bad one, its the one MySQL gives for why their table names are case sensitive. Each table is a file on disk, your filesystem is case-sensitive and the MySQL devs forgot to table_file = lc(table_name). This is heaps of fun when you move a MySQL schema to a case-insensitive filesystem.
I can think of one big reason why they shouldn't be case sensitive.
Some schema author is going to be clever and decide that this_table obviously means something different from This_Table and make those two tables (or columns). You might as well write "insert bugs here" at that point in the schema.
Also, case-insensitivity lets you be more expressive in your SQL to emphasize tables and columns vs commands without being held to what the schema author decided to do.
SELECT this, that FROM Table;

Not all sections of Unicode have a bijective mapping between upper and lower-case characters — or even two sets of cases.
In those regions, "case-insensitivity" is a little meaningless, and probably misleading.
That's about all I can think of for now; in the ASCII set, unless you want Foo and foo to be different, I don't see the point.

Most languages out there are case-sensitive, as are most comparison algorithms, most file systems, etc. Case insensitivity is for lazy users. Although it does tend to make things easier to type, and does lead to many variants of the same names differing only by case.
Personally, between (MyTable, mytable, myTable, MYTABLE, MYTable, myTABLE, MyTaBlE), I would please like to see one universal version.

Case insensitivity is a godsend when you have developers that fail to follow any sort of conventions when writing SQL or come from development languages where case insensitivity is the norm such as VB.
Generally speaking I find it easier to deal with databases where there is no possibility that ID, id, and Id are distinct fields.
Other than a personal preference for torture, I would strongly recommend you stay with case insensitivity.
The only database I ever worked on that was set up for case sensitivity was Great Plains. I found having to remember every single casing of their schema namings was painful. I have not had the privilege of working with more recent versions.
Unless it has changed and if my memory serves, the nature of case sensitivity you are speaking of is determined at installation time and is applied to all databases. It was the case with the SQL Server installion that ran the Great Plains database I mentioned that all databases on that installation were case sensitive.

I like case-sensitivity, mostly because that's what I'm used to from programming in Perl (and most any other language too). I like using StudlyCaps for table names and all lower case with underscores for columns.
Of course, many databases allow you to quote names to enforce casing, like Postgres does. That seems like a reasonable approach as well.

I do support for Sybase Advantage Database Server and it uses a flat file format allowing DBF's as well as our own proprietary ADT format. The case where I see case sensitivity being an issue is when using our Linux version of the server. Linux is a case sensitive OS so we have an option in our db to lowercase all calls. This requires that the table files be lower case.

I'm pretty sure the SQL Spec requires case folding (which is effectively the same as insensitivity) for identifiers. PostgreSQL folds to lower, ORACLE folds to upper.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight