Microsoft SQL Server collation names - sql-server

Does anybody know what the WS property of a collation does? Does it have anything to do with Asian type of scripts? The MSDN docs explain it to be "Width Sensitive", but that doesn't make any sense for say Swedish, or English...?

A good description of width sensitivity is summarized here: http://www.databasejournal.com/features/mssql/article.php/3302341/SQL-Server-and-Collation.htm
Width sensitivity
When a single-byte character
(half-width) and the same character
when represented as a double-byte
character (full-width) are treated
differently then it is width
sensitive.
Perhaps from an English character perspective, I would theorize that a width-sensitive collation would mean that 'abc' <> N'abc', because one string is a Unicode string (2 bytes per character), whereas the other one byte per character.
From a Latin characterset perspective it seems like something that wouldn't make sense to set. Perhaps in other languages this is important.
I try to set these types of collation properties to insensitive in general in order to avoid weird things like records not getting returned in search results. I usually keep accents set to insensitive, since that can cause a lot of user search headaches, depending on the audience of your applications.
Edit:
After creating a test database with the Latin1_General_CS_AS_WS collation, I found that the N'a' = N'A' is actually true. Test queries were:
select case when 'a' = 'A' then 'yes' else 'no' end
select case when 'a' = 'a' then 'yes' else 'no' end
select case when N'a' = 'a' then 'yes' else 'no' end
So in practice I'm not sure where this type of rule comes into play

The accepted answer demonstrates that it does not come into play for the comparison N'a' = 'a'. This is easily explained because the char will get implicitly converted to nchar in the comparison between the two anyway so both strings in the comparison are Unicode.
I just thought of an example of a place where width sensitivity might be expected to come into play in a Latin Collation only to discover that it appeared to make no difference at all there either...
DECLARE #T TABLE (
a VARCHAR(2) COLLATE Latin1_General_100_CS_AS_WS,
b VARCHAR(2) COLLATE Latin1_General_100_CS_AS_WS )
INSERT INTO #T
VALUES (N'Æ',
N'AE');
SELECT LEN(a) AS [LEN(a)],
LEN(b) AS [LEN(b)],
a,
b,
CASE
WHEN a = b THEN 'Y'
ELSE 'N'
END AS [a=b]
FROM #T
LEN(a) LEN(b) a b a=b
----------- ----------- ---- ---- ----
1 2 Æ AE Y
The Book "Microsoft SQL Server 2008 Internals" has this to say.
Width Sensitivity refers to East Asian
languages for which there exists both
half-width and full-width forms of
some characters.
There is absolutely nothing stopping you storing these characters in a collation such as Latin1_General_100_CS_AS_WS as long as the column has a unicode data type so I guess that the WS part would only apply in that particular situation.

Related

Comparing the same character in VARCHAR and NVARCHAR differs between CP1/CP1252 vs. CP850 based on DB collation

Here are my two variables:
DECLARE #First VARCHAR(254) = '5’-Phosphate Analogs Freedom to Operate'
DECLARE #Second NVARCHAR(254) = CONVERT(NVARCHAR(254), #First)
I have two databases, let's call them "Database1" and "Database2". Database1 has a default collation of SQL_Latin1_General_CP850_CI_AS; Database2 is SQL_Latin1_General_CP1_CI_AS. Both databases have a compatibility level of SQL Server 2008 (100).
I first connect to Database1 and run the following queries:
SELECT CASE
WHEN #First COLLATE SQL_Latin1_General_CP1_CI_AS
= #Second COLLATE SQL_Latin1_General_CP1_CI_AS
THEN 'Equal' ELSE 'Not Equal' END
SELECT CASE
WHEN #First COLLATE SQL_Latin1_General_CP850_CI_AS
= #Second COLLATE SQL_Latin1_General_CP850_CI_AS
THEN 'Equal' ELSE 'Not Equal' END
The results are:
Equal
Equal
Then I connect to Database2 and run the queries; the results are:
Equal
Not Equal
Note that I have not changed the queries themselves, just the db connection, and I'm specifying the collations to be used rather than allowing them to use the databases' default collations. Therefore, it's my understanding that the database default collation should not matter, i.e. the results of the queries should be the same regardless of which database I'm connected to.
I have three questions:
Why do I get different results when the only thing I change is the database to which I'm connected, given that I've effectively ignored the default database collation by explicitly specifying my own?
For the test against Database 2, why does the comparison succeed with the SQL_Latin1_General_CP1_CI_AS collation and fail with the SQL_Latin1_General_CP850_CI_AS collation? What is the difference between the two collations that account for this?
Most Perplexing: If the default collation of the database to which I'm connected does matter, as it would seem, and the default collation of Database1 is SQL_Latin1_General_CP850_CI_AS (which, remember from my first test resulted in Equal, Equal) why does the second query, which explicitly specifies the very same collation fail (Not Equal) when connected to Database2?
Simply because this is how non-Unicode data works. Non-Unicode data (i.e. 8-bit Extended ASCII) uses the same characters for the first 128 values, but different characters for the second set of 128 characters, based on the Code Page. The character you are testing — ’ — exists in Code Page 1252 but not in Code Page 850.
Yes, the default Collation of the "current" database absolutely matters for string literals and local variables. When you are in a database with a default Collation that uses Code Page 850, that non-Unicode string literal (i.e. a string that is not prefixed with N) automatically converts the value to an equivalent that does exist in Code Page 850. BUT, that character does indeed exist in Code Page 1252, so there is no need for it to be converted.
So why is it "not equal" when in a database using a Collation associated with Cod Page 1252 between the non-Unicode string and the Unicode string? Because when converting the non-Unicode string into Unicode, another conversion takes place that translates the character into its true Unicode value, which is above decimal value 256.
Run the following in both databases and you will see what happens:
SELECT ASCII('’') AS [AsciiValue], UNICODE('’') AS [CodePoint];
SELECT ASCII('’' COLLATE SQL_Latin1_General_CP1_CI_AS) AS [AsciiValue],
UNICODE('’' COLLATE SQL_Latin1_General_CP1_CI_AS) AS [CodePoint];
SELECT ASCII('’' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [AsciiValue],
UNICODE('’' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [CodePoint];
Results when the "current" database uses a Collation associated with Code Page 850 (all 3 queries return the same thing):
AsciiValue CodePoint
39 39
As you can see from the above, specifying COLLATE on a string literal is after the fact of how that string has already been interpreted with respect to the default Collation of the "current" database.
Results when the "current" database uses a Collation associated with Code Page 1252:
-- no COLLATE clause
AsciiValue CodePoint
146 8217
-- COLLATE SQL_Latin1_General_CP1_CI_AS
AsciiValue CodePoint
146 8217
-- COLLATE SQL_Latin1_General_CP850_CI_AS
AsciiValue CodePoint
39 39
But why the conversion from 146 to 8217 if the character is available in Code Page 1252? Because the first 256 characters in Unicode are not Code Page 1252, but instead are ISO-8859-1. These two Code Pages are mostly the same, but differ by several character in the 128 - 255 range. In the ISO-8859-1 Code Page, those values are control characters. Microsoft felt it better to not waste 16 (or however many) characters on non-printable control characters when the limit was already 256 characters. So they swapped out the control characters for more usable ones, and hence Code Page 1252. But the Unicode group used the standardized ISO-8859-1 for the first 256 characters.
Why does this matter? Because the character you are testing with is one of the lucky few that is in Code Page 1252 but not in ISO-8859-1, hence it cannot remain as 146 when converted to NVARCHAR, and gets translated to its Unicode value, which is 8217. You can see this behavior by running the following:
SELECT '~' + CHAR(146) + '~', N'~' + NCHAR(146) + N'~';
-- ~’~ ~~
Everything shown above explains most of the observed behavior, but does not explain why #First and #Second, when specified with COLLATE SQL_Latin1_General_CP850_CI_AS but running in a database having a default Collation associated with Code Page 1252, register as "Not Equal". If using Code Page 850 translates them to ASCII 39, they should still be equal, right?
This is due to both the sequence of events and the fact that Code Pages are not relevant to Unicode data (i.e. anything stored in NCHAR, NVARCHAR, and the deprecated NTEXT type that nobody should be using). Breaking down what is happening:
Start with #First being declared and initialized (i.e. DECLARE #First VARCHAR(1) = '’';). It is a VARCHAR type, hence using a Code Page, and hence using the Code Page associated with the default Collation of the "current" database.
The default Collation of the "current" database is associated with Code Page 1252, hence this value is not translated to ASCII 39, but exists happily as ASCII 146.
Next #Second is declared and initialized (i.e. DECLARE #Second NVARCHAR(1) = #First; -- no need for explicit CONVERT as this is not production code and it will be converted implicitly). This is an NVARCHAR type which, as we have seen, has the character, but converts the value from ASCII 146 to Code Point U+2019 (Decimal 8217 = 0x2019).
In the comparison, using #First COLLATE SQL_Latin1_General_CP850_CI_AS starts with ASCII 146 as #First is VARCHAR data using the Code Page specified by the default Collation of the "current" database. But then, since that character does not exist in Code Page 850 (as specified by the Collation used in the COLLATE clause) it gets translated into ASCII 39 (as we have seen above).
Why didn't #Second COLLATE SQL_Latin1_General_CP850_CI_AS also translate that character to ASCII 39 so that they would register as "Equal"? Because:
#Second is NVARCHAR and does not use Code Pages as all characters are represented in a single character set (i.e. Unicode). So changing the Collation can only change the rules governing how to compare and sort the characters, but will not alter the characters such as what happens sometimes when changing the Collation of VARCHAR data (like in this case of ’). Hence this side of the comparison is still Code Point U+2019.
#First, being VARCHAR will get implicitly converted into NVARCHAR for the comparison. BUT, the ’ character had already been translated into ASCII 39 by the COLLATE SQL_Latin1_General_CP850_CI_AS clause, and ASCII 39 is found in Unicode in that same position, either as Decimal 39 or Code Point U+0027 (from SELECT CONVERT(BINARY(2), 39)).
Resulting comparison is between: Code Point U+2019 and Code Point U+0027
Ergo: Not Equal
For more info on working with Collations, Encodings, Unicode, etc, please visit: Collations Info

Determining index of last uppercase letter in column value (SQL)?

Short version: Is there a way to easily extract and ORDER BY a substring of values in a DB column based on the index (position) of the last upper case letter in that value, only using SQL?
Long version: I have a table with a username field, and the convention for usernames is the capitalized first initial of the first name, followed by the capitalized first initial of the last name, followed by the rest of the last name. As a result, ordering by the username field is 'wrong'. Ordering by a substring of the username value would theoretically work, e.g.
SUBSTRING(username,2, LEN(username))
...except that there are values with a capitalized middle initials between the other two initials. I am curious to know if, using only SQL (MS SQL Server,) there is a fairly straightforward / simple way to:
Test the case of a character in a DB value (and return a boolean)
Determine the index of the last upper case character in a string value
Assuming this is even remotely possible, I assume one would have to loop through the individual letters of each username to accomplish it, making it terribly inefficient, but if you have a magical shortcut, feel free to share. Note: This question is purely academic as I have since decided to go a much simpler way. I am just curious if this is even possible.
Test the case of a character in a DB value (and return a boolean)
SQL Server does not have a boolean datatype. bit is often used in its place.
DECLARE #Char CHAR(1) = 'f'
SELECT CAST(CASE
WHEN #Char LIKE '[A-Z]' COLLATE Latin1_General_Bin
THEN 1
ELSE 0
END AS BIT)
/* Returns 0 */
Note it is important to use a binary collation rather than a case sensitive collate clause with the above syntax. If using a CS collate clause the pattern would need to be spelled out in full as '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' to avoid matching lower case characters.
Determine the index of the last upper case character in a string value
SELECT PATINDEX('%[A-Z]%' COLLATE Latin1_General_Bin, REVERSE('Your String'))
/* Returns one based index (6 ) */
SELECT PATINDEX('%[A-Z]%' COLLATE Latin1_General_Bin, REVERSE('no capitals'))
/* Returns 0 if the test string doesn't contain any letters in the range A-Z */
To extract the surname according to those rules you can use
SELECT RIGHT(Name,PATINDEX('%[A-Z]%' COLLATE Latin1_General_Bin ,REVERSE(Name)))
FROM YourTable

Confused about default string comparison option in SQL Server

I am completely confused about the default string comparison method used in Microsoft SQL Server. Up till now I had been using UPPER() and LOWER() functions for performing any string comparison on Microsoft SQL Server.
However got to know that by default Microsoft SQL Server is case insensitive and we need to change the collation while installing Microsoft SQL Server to make it case sensitive. However if this is the case then what is the use of UPPER and LOWER() functions.
if you like to compare case sensitive string this might be the syntax you looking for
IF #STR1 COLLATE Latin1_General_CS_AS <> #STR2 COLLATE Latin1_General_CS_AS
PRINT 'NOT MATCH'
As you have discovered, upper and lower are only of use in comparisons when you have a case-sensitive collation applied, but that doesn't make them useless.
For example, Upper and Lower can be used for formatting results.
select upper(LicencePlate) from cars
You can apply collations without reinstalling, by applying to a column in the table design, or to specific comparisons ie:
if 'a' = 'A' collate latin1_general_cs_as
select '1'
else
select '2'
if 'a' = 'A' collate latin1_general_ci_as
select '3'
else
select '4'
See http://technet.microsoft.com/en-us/library/aa258272(v=sql.80).aspx

Multi-language support

We have developed a site that needs to display text in English, Polish, Slovak and Czech. However, when the text is entered into the database, any accented letters are changed to english letters.
After searching around on forums, I have found that it is possible to put an 'N' in front of a string which contains accented characters. For example:
INSERT INTO Table_Name (Col1, Col2) VALUES (N'Value1', N'Value2')
However, the site has already been fully developed so at this stage, going through all of the INSERT and UPDATE queries in the site would be a very long and tedious process.
I was wondering if there is any other, much quicker, way of doing what I am trying to do?
The database is MSSQL and the columns being inserted into are already nvarchar(n).
There isn't any quick solution.
The updates and inserts are wrong and need to be fixed.
If they were parameterized queries, you could have simply made sure they were using the NVarChar database type and you would not have a problem.
Since they are dynamic strings, you will need to ensure that you add the unicode specifier (N) in front of each text field you are inserting/updating.
Topic-starter wrote:
"text in English, Polish, Slovak and Czech. However, when the text is entered into the database, any accented letters are changed to english letters" After searching around on forums, I have found that it is possible to put an 'N' in front of a string which contains accented characters. For example:
INSERT INTO Table_Name (Col1, Col2) VALUES (N'Value1', N'Value2')
"The collation for the database as a whole is Latin1_General_CI_AS"
I do not see how it could happen due to SQL Server since Latin1_General_CI_AS treats european "non-English" letters:
--on database with collation Latin1_General_CI_AS
declare #test_multilanguage_eu table
(
c1 char(12),
c2 nchar(12)
)
INSERT INTO #test_multilanguage_eu VALUES ('éÉâÂàÀëËçæà', 'éÉâÂàÀëËçæà')
SELECT c1, cast(c1 as binary(4)) as c1bin, c2, cast(c2 as binary(4)) as c2bin
FROM #test_multilanguage_eu
outputs:
c1 c1bin c2 c2bin
------------ ---------- ------------ ----------
éÉâÂàÀëËçæà 0xE9C9E2C2 éÉâÂàÀëËçæà 0xE900C900
(1 row(s) affected)
I believe you simply have to check checkboxes them Control Panel --> Regional and Language Options --> tab Advanced --> Code page conversion tables and check that you render in the same codepage as you store it.
Converting to unicode from encodings used by clients would lead to problems to render back to webclients, it seems to me.
I believe that most European collation designators use codepage 1252 [1], [2].
Update:
SELECT
COLLATIONPROPERTY('Latin1_General_CI_AS' , 'CodePage')
outputs 1252
[1]
http://msdn.microsoft.com/en-us/library/ms174596.aspx
[2]
Windows 1252
http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

Unicode characters causing issues in SQL Server 2005 string comparison

This query:
select *
from op.tag
where tag = 'fussball'
Returns a result which has a tag column value of "fußball". Column "tag" is defined as nvarchar(150).
While I understand they are similar words grammatically, can anyone explain and defend this behavior? I assume it is related to the same collation settings which allow you to change case sensitivity on a column/table, but who would want this behavior? A unique constraint on the column also causes failure on inserts of one value when the other exists due to a constraint violation. How do I turn this off?
Follow-up bonus point question. Explain why this query does not return any rows:
select 1
where 'fußball' = 'fussball'
Bonus question (answer?): #ScottCher pointed out to me privately that this is due to the string literal "fussball" being treated as a varchar. This query DOES return a result:
select 1
where 'fußball' = cast('fussball' as nvarchar)
But then again, this one does not:
select 1
where cast('fußball' as varchar) = cast('fussball' as varchar)
I'm confused.
I guess the Unicode collation set for your connection/table/database specifies that ss == ß. The latter behavior would be because it's on a faulty fast path, or maybe it does a binary comparison, or maybe you're not passing in the ß in the right encoding (I agree it's stupid).
http://unicode.org/reports/tr10/#Searching mentions that U+00DF is special-cased. Here's an insightful excerpt:
Language-sensitive searching and
matching are closely related to
collation. Strings that compare as
equal at some strength level are those
that should be matched when doing
language-sensitive matching. For
example, at a primary strength, "ß"
would match against "ss" according to
the UCA, and "aa" would match "å" in a
Danish tailoring of the UCA.
The SELECT does return a row with collation Latin1_General_CI_AS (SQL2000).
It does not with collation Latin1_General_BIN.
You can assign a table column a collation by using the COLLATE < collation > keyword after N/VARCHAR.
You can also compare strings with a specific collation using the syntax
string1 = string2 COLLATE < collation >
This isn't an answer that explains behavior, but may be relevant:
In this question, I learned that using the collation of
Latin1_General_Bin
will avoid most collation quirks.
Some helper answers - not the complete one to your question, but still maybe helpful:
If you try:
SELECT 1 WHERE N'fußball' = N'fussball'
you'll get "1" - when using the "N" to signify Unicode, the two strings are considered the same - why that's the case, I don't know (yet).
To find the default collation for a server, use
SELECT SERVERPROPERTY('Collation')
To find the collation of a given column in a database, use this query:
SELECT
name 'Column Name',
OBJECT_NAME(object_id) 'Table Name',
collation_name
FROM sys.columns
WHERE object_ID = object_ID('your-table-name')
AND name = 'your-column-name'
Bonus question (answer?): #ScottCher
pointed out to me privately that this
is due to the string literal
"fussball" being treated as a varchar.
This query DOES return a result:
select 1 where 'fußball' =
cast('fussball' as nvarchar)
Here you're dealing with the SQL Server data type precedence rules, as stated in Data Type Precedence. Comparisons are done always using the higher precedence type:
When an operator combines two
expressions of different data types,
the rules for data type precedence
specify that the data type with the
lower precedence is converted to the
data type with the higher precedence.
Since nvarchar has a higher precedence than varchar, the comparison in your example will occur suing the nvarchar type, so it's really exactly the same as select 1 where N'fußball' =N'fussball' (ie. using Unicode types). I hope this also makes it clear why your last case doesn't return any row.

Resources