Collation and datatype incompatibility on strings - sql-server

I am very confused on the behavior of the system on the case where both collation and datatype differences are involved.
As a minimal example, I am inputting the same Unicode value to the single column of two different tables. In one table the column is varchar and of a certain collation, and on the other it's nvarchar and of another collation. Code and results:
create table cn(code nvarchar(max) collate Latin1_General_CI_AS)
create table cv(code varchar(max) collate SQL_Latin1_General_CP1253_CI_AI)
insert cn select N'3VT18021δ'
insert cv select N'3VT18021δ'
select * from cn
select * from cv
--1.
select * from cn inner join cv on cn.code=cv.code
-- Cannot resolve the collation conflict between "SQL_Latin1_General_CP1253_CI_AI" and "Latin1_General_CI_AS" in the equal to operation.
--2.
select * from cn inner join cv on cn.code=cv.code collate SQL_Latin1_General_CP1253_CI_AI
-- returns one row
--3.
select * from cn inner join cv on cn.code =cv.code collate Latin1_General_CI_AS
-- returns 0 rows
--4.
select * from cn inner join cv on cn.code collate SQL_Latin1_General_CP1253_CI_AI =cv.code
-- returns one row
--5.
select * from cn inner join cv on cn.code collate Latin1_General_CI_AS =cv.code
-- returns one row
My notes:
Case 1: collation difference, I understand
Cases 2 and 5: return (correctly) one row. Why does collating a field
to its own collation do any good?
Cases 3 and 4: Why converting one's collation to the other works one
time, but not the other?
Of course, all these get further complication from the datatype difference.

Cases 2 and 5: return (correctly) one row. Why does collating a field to its own collation do any good?
When you explicitly use COLLATE on a value in a clause both sides of the expression are explicitly converted to that collation, thus there is no conflict.
Cases 3 and 4: Why converting one's collation to the other works one time, but not the other?
One of your columns is a varchar, so when it's changed from one collation to the other, its value changes. This is, specifically, when you COLLATE the value in your table cv to the collation Latin1_General_CI_AS. As 'δ' isn't a character available in the collation for a varchar, it changes to a 'd' and '3VT18021d' does not equal N'3VT18021δ'. You can see this with the below:
SELECT code COLLATE Latin1_General_CI_AS
FROM cv;
You would need to explicitly convert the value to a nvarchar first:
select *
from cn
inner join cv on cn.code = CONVERT(nvarchar(MAX),cv.code) collate Latin1_General_CI_AS;
--Returns one row now
Edit: To explain why Query 3 does not return data, and Query 5 does, this is because of the positioning of the COLLATEs and when the implicit conversion happens.
cn.code =cv.code collate Latin1_General_CI_AS --3
cn.code collate Latin1_General_CI_AS =cv.code --5
For Query 3, the COLLATE expression is on cv.code, which is the varchar. As a result the value has it's collation changed first and the character 'δ' is lost. Then it is implicitly converted to an nvarchar, due to data type precedence.
For Query 5, however, the COLLATE is on cn.code the nvarchar. As a result when the value's collation is changed no characters are lost. As cv.code doesn't have an explicit COLLATE, it is instead first converted to an nvarchar (due to data type precendence) and then collated; causing no loss of characters.

A collation is a part of the dataype. Internal representation of chars may differs if you use different collation and many constraints does not have the same behaviour when using different collations (PRIMARY KEY, UNIQUE, CHECK...).
The mixing of different collation in operators (=, LIKE, +) and in some function (CONCAT...) results systematically as an error until you impose a specific collation for this operation.
So there is a COLLATE key word acting as an operator to disambiguate which collation can be used.
SQL Server distinguishes two kind of collations.
Technical collations with a name beginning with SQL_
semantical collation for functionnalities purpose, with a name that begin with a language name
Technical collations must be used only to recover imported data that have a specific encoding... As an example, you can have collations that are the strict equivalent of IBM EBCDIC, but it will be a stupid idea to keep this collations for SQL Server tables manipulations !
Semantical collations are widely use to facilitate application functionnalities... Do you want a CI or CS (case behaviour), AI or AS (diacritical behaviour), WS (wide behaviour like 2 = ²), etc...
Using this queries :
select CAST(code AS VARBINARY(max)) from cn;
select CAST(code AS VARBINARY(max)) from cv;
You will find that the last caracter does not have the same code. It is why the results is no rows when using the Latin1_General_CI_AS collation...
You will see that "B403" char of the NVARCHAR(max) dataype which is encoded on 2 bytes cannot be translated into the PAGE CODE CP1253 on a 1 byte per char...
In fact B4 byte in VARCHAR with SQL_Latin1_General_CP1253_CI_AI is "ä" not "δ"
In other words trying to put 1 byte in 2 bytes is easy... Just some zero to add. But, conversely, trying to put 2 bytes in one is possible only if the byte on the right is zeroed...

Related

Using CAST, CONCAT and COLLATE with a LEFT OUTER JOIN

I'm trying to CONCAT two columns and also use CAST and COLLATE but keep getting a host of different errors when I try to fix them in a way I think would work.
Basically I am trying to CONCAT two columns together but I get a collation conflict. So then, I try and COLLATE the two columns and I then get a datatype is invalid for COLLATE error. After this I try to CAST the column giving me the error to change it to a varchar but it doesn't work. I'm just unsure how to make all 3 work together.
SELECT TransactionHeader.TransactionType,
TransactionHeader.TicketStub,
CAST ( TransactionHeader.TransactionNumber AS nvarchar(8)) AS [TN],
TransactionHeader.ActualAmount,
Currencies.SwiftCode,
TransactionHeader.CurrencyID,
Divisions.ShortName,
DealHeader.StartDateNumber,
DealHeader.EndDateNumber,
CONCAT (TransactionHeader.TicketStub,
TransactionHeader.TransactionNumber) AS [DealRef]
FROM Company.dbo.TransactionHeader TransactionHeader
LEFT OUTER JOIN Company.dbo.DealHeader DealHeader
ON TransactionHeader.THDealID=DealHeader.DHDealID
LEFT OUTER JOIN Company.dbo.Currencies Currencies
ON TransactionHeader.CurrencyID=Currencies.CRRecordID
LEFT OUTER JOIN Company.dbo.Divisions Divisions
ON TransactionHeader.PrimaryPartyID=Divisions.DVRecordID
WHERE TransactionHeader.TicketStub COLLATE DATABASE_DEFAULT
= TransactionHeader.TransactionNumber COLLATE DATABASE_DEFAULT
All in all, I just want to CONCAT the TicketStub and TransactionNumber Columns but I am not sure how to get past the errors I'm getting. As far as the COLLATE goes I'm still kind of usnsure how it even works, I just know to fix the collation error I need to do it. I am very new to T-SQL and have only been writing it for the past month and a half so please, any advice at all would be very helpful. Thank you!
Collation is a setting that determines how a DB should treat character data at either the server, database, or column level. There's a really good blog on this at red-gate.. Each server, and database, will have a collation. It's common for the databases and server to match, since by default a database will inherit this setting from the model database. It is uncommon to see column level collation, but that seems to be what you have here since all of your tables are coming from the same DATABASE.
You will need to figure out what the collation is on those columns. Dave Pinal has a good write up on this on his blog. You can also do this a few other ways. See the docs for that.
Once you have your collation, you can then collate the CONCAT. It will look something like the below. Here I just use the DATABASE_DEFUALT which would probably work in your case:
CONCAT(TransactionHeader.TicketStub COLLATE DATABASE_DEFAULT,TransactionHeader.TransactionNumber COLLATE DATABASE_DEFAULT) AS [DealRef]
You can find more examples of COLLATE WITH CONCAT in this answer and this one

unexpected output sql server using count

I am using sql-server 2012
The query is :
CREATE TABLE TEST ( NAME VARCHAR(20) );
INSERT TEST
( NAME
)
SELECT NULL
UNION ALL
SELECT 'James'
UNION ALL
SELECT 'JAMES'
UNION ALL
SELECT 'Eric';
SELECT NAME
, COUNT(NAME) AS T1
, COUNT(COALESCE(NULL, '')) T2
, COUNT(ISNULL(NAME, NULL)) T3
, COUNT(DISTINCT ( Name )) T4
, COUNT(DISTINCT ( COALESCE(NULL, '') )) T5
, ##ROWCOUNT T6
FROM TEST
GROUP BY Name;
DROP TABLE TEST;
In the result set ther is no 'JAMES' ? (caps)
please tell how this was excluded
expected was Null,james,JAMES,eric
You need to change your Name column collation to Latin1_General_CS_AS which is case sensitive
SELECT NAME COLLATE Latin1_General_CS_AS,
Count(NAME) AS T1,
Count(COALESCE(NULL, '')) T2,
Count(Isnull(NAME, NULL)) T3,
Count(DISTINCT ( Name )) T4,
Count(DISTINCT ( COALESCE(NULL, '') )) T5,
##ROWCOUNT T6
FROM TEST
GROUP BY Name COLLATE Latin1_General_CS_AS;
Use a sensitive case collation like COLLATE Latin1_General_CS_AS.
CREATE TABLE TEST ( NAME VARCHAR(20) COLLATE Latin1_General_CS_AS );
The other people who commented here are correct.
It would be easier for you to understand their meaning if you googled for collation and case sensitivity, but in layman's terms it's like this:
Collation is a little like encoding; It determines how the characters in string columns are interpreted, ordered and compared to one another. Case insensitive means that UPPERCASE / lowercase are considered exactly the same, so for instance 'JAMES', 'james', 'JaMeS' etc would be no different to SQL Server. So when your database has a case-insensitive collation and you then create a table with a column without defining the collation, that column will inherit the default collation used by the database, which is how we arrived here.
You can manually alter a column collation, or define it during a query, but bear in mind that whenever you compare two different columns, you need to assign both of them to use the same collation, or you will get an error. That's why it's good practice to pretty much use the same collation throughout the database barring special query-specific circumstances.
To your question regarding what Latin1_General_CS_AS means, it basically means "Latin1_General" alphabet, the details of which you can check online. The "CS" part means case-sensitive, if it were case-insensitive you would see "CI" instead. The "AS" means accent-sensitivity, and "AI" would mean accent-insensitivity. Basically, whether 'Á' is considered to be equal to 'A', or not.
You can read a lot more about it from the source, here.

ORDER BY ... COLLATE in SQL Server

Under SQL Server. A table contains some text with different cases. I want to sort them case-sensitive and thought that a COLLATE in the ORDER BY would do it. It doesn't. Why?
CREATE TABLE T1 (C1 VARCHAR(20))
INSERT INTO T1 (C1) VALUES ('aaa1'), ('AAB2'), ('aba3')
SELECT * FROM T1 ORDER BY C1 COLLATE Latin1_General_CS_AS
SELECT * FROM T1 ORDER BY C1 COLLATE Latin1_General_CI_AS
Both queries return the same, even if the first one is "CS" for case-sensitive
aaa1
AAB2
aba3
(in the first case, I want AAB2, aaa1, aba3)
My server is a SQL Server Express 2008 (10.0.5500) and its default server collation is Latin1_General_CI_AS.
The collation of the database is Latin1_General_CI_AS too.
The result remains the same if I use SQL_Latin1_General_CP1_CS_AS in place of Latin1_General_CS_AS.
You need a binary collation for your desired sort order with A-Z sorted before a-z.
SELECT * FROM T1 ORDER BY C1 COLLATE Latin1_General_bin
The CS collation sorts aAbB ... zZ
Because that is the correct case sensitive collation sort order. It is explained in Case Sensitive Collation Sort Order why this is the case, it has to do with the Unicode specifications for sorting. aa will sort ahead of AA but AA will sort ahead of ab.

How to set collation for a connection in SQL Server?

How can i set the collation SQL Server will use for the duration of that connection?
Not until i connect to SQL Server do i know what collation i want to use.
e.g. a browser with language fr-IT has connected to the web-site. Any queries i run on that connection i want to follow the French language, Italy variant collation.
i envision a hypothetical connection level property, simlar to SET ANSI_NULLS OFF, but for collation1:
SET COLLATION_ORDER 'French_CI_AS'
SELECT TOP 100 FROM Orders
ORDER BY ProjectName
and later
SELECT * FROM Orders
WHERE CustomerID = 3277
AND ProjectName LIKE '%l''ecole%'
and later
UPDATE Quotes
SET IsCompleted = 1
WHERE QuoteName = 'Cour de l''école'
At the same time, when a chinese customer connects:
SET COLLATION_ORDER Chinese_PRC_CI_AI_KS_WS
SELECT TOP 100 FROM Orders
ORDER BY ProjectName
or
SELECT * FROM Orders
WHERE CustomerID = 3277
AND ProjectName LIKE '學校'
or
UPDATE Quotes
SET IsCompleted = 1
WHERE QuoteName = '學校的操場'
Now i could alter every SELECT statement in the system to allow me to pass in a collation:
SELECT TOP 100 FROM Orders
WHERE CustomerID = 3278
ORDER BY ProjectName COLLATE French_CI_AS
But you cannot pass a collation order as a parameter to a stored procedure:
CREATE PROCEDURE dbo.GetCommonOrders
#CustomerID int, #CollationOrder varchar(50)
AS
SELECT TOP 100 FROM Orders
WHERE CustomerID = #CustomerID
ORDER BY ProjectName COLLATE #CollationOrder
And the COLLATE clause can't help me when performing an UPDATE or a SELECT.
Note: All string columns in the database all are already nchar, nvarchar or ntext. i am not talking about the default collation applied to a server, database, table, or column for non-unicode columns (i.e. char, varchar, text). i am talking about the collation used by SQL Server when comparing and sorting strings.
How can i specify per-connection collation?
See also
Similar question, but for ADO.net and connection strings
Similar question, but for ASP.net MVC2 and MySQL
1 hypothetical sql that exhibits locale issues
As marc_s commented, the collation is a property of a database or a column, and not of a connection.
However, you can override the collation on statement level using the COLLATE keyword.
Using your examples:
SELECT * FROM Orders
WHERE CustomerID = 3277
AND ProjectName COLLATE Chinese_PRC_CI_AI_KS_WS LIKE N'學校'
UPDATE Quotes
SET IsCompleted = 1
WHERE QuoteName COLLATE Chinese_PRC_CI_AI_KS_WS = N'學校的操場'
Still, I cannot find a statement on using COLLATE with a dynamic collation name, leaving as only possible solution dynamic SQL and EXEC. See this social.MSDN entry for an example.

SQL change field Collation in a select

i'm trying to do the following select:
select * from urlpath where substring(urlpathpath, 3, len(urlpathpath))
not in (select accessuserpassword from accessuser where accessuserparentid = 257)
I get the error:
Cannot resolve the collation conflict between
"SQL_Latin1_General_CP1_CI_AI" and
"SQL_Latin1_General_CP1_CI_AS" in the equal to operation.
Does anyone know how i can cast as a collation, or something that permits me to match this condition?
Thanx
You can add COLLATE CollationName after the column name for the column you want to "re-collate". (Note: the collation name is literal, not quoted)
You can even do the collate on the query to create a new table with that query, for example:
SELECT
*
INTO
#TempTable
FROM
View_total
WHERE
YEAR(ValidFrom) <= 2007
AND YEAR(ValidTo)>= 2007
AND Id_Product = '001'
AND ProductLine COLLATE DATABASE_DEFAULT IN (SELECT Product FROM #TempAUX)
COLLATE DATABASE_DEFAULT causes the COLLATE clause inherits the collation of the current database, eliminating the difference between the two

Resources