ORDER BY ... COLLATE in SQL Server - sql-server

Under SQL Server. A table contains some text with different cases. I want to sort them case-sensitive and thought that a COLLATE in the ORDER BY would do it. It doesn't. Why?
CREATE TABLE T1 (C1 VARCHAR(20))
INSERT INTO T1 (C1) VALUES ('aaa1'), ('AAB2'), ('aba3')
SELECT * FROM T1 ORDER BY C1 COLLATE Latin1_General_CS_AS
SELECT * FROM T1 ORDER BY C1 COLLATE Latin1_General_CI_AS
Both queries return the same, even if the first one is "CS" for case-sensitive
aaa1
AAB2
aba3
(in the first case, I want AAB2, aaa1, aba3)
My server is a SQL Server Express 2008 (10.0.5500) and its default server collation is Latin1_General_CI_AS.
The collation of the database is Latin1_General_CI_AS too.
The result remains the same if I use SQL_Latin1_General_CP1_CS_AS in place of Latin1_General_CS_AS.

You need a binary collation for your desired sort order with A-Z sorted before a-z.
SELECT * FROM T1 ORDER BY C1 COLLATE Latin1_General_bin
The CS collation sorts aAbB ... zZ

Because that is the correct case sensitive collation sort order. It is explained in Case Sensitive Collation Sort Order why this is the case, it has to do with the Unicode specifications for sorting. aa will sort ahead of AA but AA will sort ahead of ab.

Related

Collation and datatype incompatibility on strings

I am very confused on the behavior of the system on the case where both collation and datatype differences are involved.
As a minimal example, I am inputting the same Unicode value to the single column of two different tables. In one table the column is varchar and of a certain collation, and on the other it's nvarchar and of another collation. Code and results:
create table cn(code nvarchar(max) collate Latin1_General_CI_AS)
create table cv(code varchar(max) collate SQL_Latin1_General_CP1253_CI_AI)
insert cn select N'3VT18021δ'
insert cv select N'3VT18021δ'
select * from cn
select * from cv
--1.
select * from cn inner join cv on cn.code=cv.code
-- Cannot resolve the collation conflict between "SQL_Latin1_General_CP1253_CI_AI" and "Latin1_General_CI_AS" in the equal to operation.
--2.
select * from cn inner join cv on cn.code=cv.code collate SQL_Latin1_General_CP1253_CI_AI
-- returns one row
--3.
select * from cn inner join cv on cn.code =cv.code collate Latin1_General_CI_AS
-- returns 0 rows
--4.
select * from cn inner join cv on cn.code collate SQL_Latin1_General_CP1253_CI_AI =cv.code
-- returns one row
--5.
select * from cn inner join cv on cn.code collate Latin1_General_CI_AS =cv.code
-- returns one row
My notes:
Case 1: collation difference, I understand
Cases 2 and 5: return (correctly) one row. Why does collating a field
to its own collation do any good?
Cases 3 and 4: Why converting one's collation to the other works one
time, but not the other?
Of course, all these get further complication from the datatype difference.
Cases 2 and 5: return (correctly) one row. Why does collating a field to its own collation do any good?
When you explicitly use COLLATE on a value in a clause both sides of the expression are explicitly converted to that collation, thus there is no conflict.
Cases 3 and 4: Why converting one's collation to the other works one time, but not the other?
One of your columns is a varchar, so when it's changed from one collation to the other, its value changes. This is, specifically, when you COLLATE the value in your table cv to the collation Latin1_General_CI_AS. As 'δ' isn't a character available in the collation for a varchar, it changes to a 'd' and '3VT18021d' does not equal N'3VT18021δ'. You can see this with the below:
SELECT code COLLATE Latin1_General_CI_AS
FROM cv;
You would need to explicitly convert the value to a nvarchar first:
select *
from cn
inner join cv on cn.code = CONVERT(nvarchar(MAX),cv.code) collate Latin1_General_CI_AS;
--Returns one row now
Edit: To explain why Query 3 does not return data, and Query 5 does, this is because of the positioning of the COLLATEs and when the implicit conversion happens.
cn.code =cv.code collate Latin1_General_CI_AS --3
cn.code collate Latin1_General_CI_AS =cv.code --5
For Query 3, the COLLATE expression is on cv.code, which is the varchar. As a result the value has it's collation changed first and the character 'δ' is lost. Then it is implicitly converted to an nvarchar, due to data type precedence.
For Query 5, however, the COLLATE is on cn.code the nvarchar. As a result when the value's collation is changed no characters are lost. As cv.code doesn't have an explicit COLLATE, it is instead first converted to an nvarchar (due to data type precendence) and then collated; causing no loss of characters.
A collation is a part of the dataype. Internal representation of chars may differs if you use different collation and many constraints does not have the same behaviour when using different collations (PRIMARY KEY, UNIQUE, CHECK...).
The mixing of different collation in operators (=, LIKE, +) and in some function (CONCAT...) results systematically as an error until you impose a specific collation for this operation.
So there is a COLLATE key word acting as an operator to disambiguate which collation can be used.
SQL Server distinguishes two kind of collations.
Technical collations with a name beginning with SQL_
semantical collation for functionnalities purpose, with a name that begin with a language name
Technical collations must be used only to recover imported data that have a specific encoding... As an example, you can have collations that are the strict equivalent of IBM EBCDIC, but it will be a stupid idea to keep this collations for SQL Server tables manipulations !
Semantical collations are widely use to facilitate application functionnalities... Do you want a CI or CS (case behaviour), AI or AS (diacritical behaviour), WS (wide behaviour like 2 = ²), etc...
Using this queries :
select CAST(code AS VARBINARY(max)) from cn;
select CAST(code AS VARBINARY(max)) from cv;
You will find that the last caracter does not have the same code. It is why the results is no rows when using the Latin1_General_CI_AS collation...
You will see that "B403" char of the NVARCHAR(max) dataype which is encoded on 2 bytes cannot be translated into the PAGE CODE CP1253 on a 1 byte per char...
In fact B4 byte in VARCHAR with SQL_Latin1_General_CP1253_CI_AI is "ä" not "δ"
In other words trying to put 1 byte in 2 bytes is easy... Just some zero to add. But, conversely, trying to put 2 bytes in one is possible only if the byte on the right is zeroed...

unexpected output sql server using count

I am using sql-server 2012
The query is :
CREATE TABLE TEST ( NAME VARCHAR(20) );
INSERT TEST
( NAME
)
SELECT NULL
UNION ALL
SELECT 'James'
UNION ALL
SELECT 'JAMES'
UNION ALL
SELECT 'Eric';
SELECT NAME
, COUNT(NAME) AS T1
, COUNT(COALESCE(NULL, '')) T2
, COUNT(ISNULL(NAME, NULL)) T3
, COUNT(DISTINCT ( Name )) T4
, COUNT(DISTINCT ( COALESCE(NULL, '') )) T5
, ##ROWCOUNT T6
FROM TEST
GROUP BY Name;
DROP TABLE TEST;
In the result set ther is no 'JAMES' ? (caps)
please tell how this was excluded
expected was Null,james,JAMES,eric
You need to change your Name column collation to Latin1_General_CS_AS which is case sensitive
SELECT NAME COLLATE Latin1_General_CS_AS,
Count(NAME) AS T1,
Count(COALESCE(NULL, '')) T2,
Count(Isnull(NAME, NULL)) T3,
Count(DISTINCT ( Name )) T4,
Count(DISTINCT ( COALESCE(NULL, '') )) T5,
##ROWCOUNT T6
FROM TEST
GROUP BY Name COLLATE Latin1_General_CS_AS;
Use a sensitive case collation like COLLATE Latin1_General_CS_AS.
CREATE TABLE TEST ( NAME VARCHAR(20) COLLATE Latin1_General_CS_AS );
The other people who commented here are correct.
It would be easier for you to understand their meaning if you googled for collation and case sensitivity, but in layman's terms it's like this:
Collation is a little like encoding; It determines how the characters in string columns are interpreted, ordered and compared to one another. Case insensitive means that UPPERCASE / lowercase are considered exactly the same, so for instance 'JAMES', 'james', 'JaMeS' etc would be no different to SQL Server. So when your database has a case-insensitive collation and you then create a table with a column without defining the collation, that column will inherit the default collation used by the database, which is how we arrived here.
You can manually alter a column collation, or define it during a query, but bear in mind that whenever you compare two different columns, you need to assign both of them to use the same collation, or you will get an error. That's why it's good practice to pretty much use the same collation throughout the database barring special query-specific circumstances.
To your question regarding what Latin1_General_CS_AS means, it basically means "Latin1_General" alphabet, the details of which you can check online. The "CS" part means case-sensitive, if it were case-insensitive you would see "CI" instead. The "AS" means accent-sensitivity, and "AI" would mean accent-insensitivity. Basically, whether 'Á' is considered to be equal to 'A', or not.
You can read a lot more about it from the source, here.

Collation conflict with temp table

The collation of tempdb is Latin1_General_100_CI_AI. The collation of the database is also Latin1_General_100_CI_AI. Yet the following SQL statement:
SELECT *
FROM ##CitiesMapping AS cm
INNER JOIN Cities ON cm.CityName=Cities.Name
returns:
Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "Latin1_General_100_CI_AI" in the equal to operation.
The server default collation is also Latin1_General_100_CI_AI
It is possible that the collation is set differently for a single column. The query from Stuart will show you that. If they are different collations you can specify the collation being used on either side of the comparison like this:
SELECT *
FROM ##CitiesMapping AS cm
INNER JOIN Cities
ON cm.CityName COLLATE DATABASE_DEFAULT = Cities.Name COLLATE DATABASE_DEFAULT;
I hope this helps you out.
Check the tables involved as well:
SELECT name, collation_name, OBJECT_NAME(object_id)
FROM sys.columns
WHERE OBJECT_NAME(object_id) IN ('Cities')

SQL change field Collation in a select

i'm trying to do the following select:
select * from urlpath where substring(urlpathpath, 3, len(urlpathpath))
not in (select accessuserpassword from accessuser where accessuserparentid = 257)
I get the error:
Cannot resolve the collation conflict between
"SQL_Latin1_General_CP1_CI_AI" and
"SQL_Latin1_General_CP1_CI_AS" in the equal to operation.
Does anyone know how i can cast as a collation, or something that permits me to match this condition?
Thanx
You can add COLLATE CollationName after the column name for the column you want to "re-collate". (Note: the collation name is literal, not quoted)
You can even do the collate on the query to create a new table with that query, for example:
SELECT
*
INTO
#TempTable
FROM
View_total
WHERE
YEAR(ValidFrom) <= 2007
AND YEAR(ValidTo)>= 2007
AND Id_Product = '001'
AND ProductLine COLLATE DATABASE_DEFAULT IN (SELECT Product FROM #TempAUX)
COLLATE DATABASE_DEFAULT causes the COLLATE clause inherits the collation of the current database, eliminating the difference between the two

Force T-SQL query to be case sensitive in MS

I have a table that originates in an old legacy system that was case senstive, in particular a status column where 's' = 'Schedule import' and 'S' = 'Schedule management'. This table eventually makes its way into a SQL Server 2000 database which I can query against. My query is relatively simple just going for counts...
Select trans_type, count(1) from mytable group by trans_type
This is grouping the counts for 'S' along with the 's' counts. Is there any way to force a query to be cap sensitive? I have access to both SQL Server 2000 and 2005 environments to run this, however have limited admin capability on the server (so I can't set server attributes)... I guess I could move the data to my local and setup something on my local where I have full access to server options, but would prefer a tsql solution.
select trans_type collate SQL_Latin1_General_CP1_CS_AS, count(*)
from mytable
group by trans_type collate SQL_Latin1_General_CP1_CS_AS
You can do this with =, like, and other operators as well. Note that you must modify the select list because you are no longer grouping by trans_type, you are now grouping by trans_type collate SQL_Latin1_General_CP1_CS_AS. Kind of a gotcha.
Can you introduce a trans_type_ascii column with the ascii value of the trans_type and group on that instead? Or any other column you can use (isUpperCase) to distinguish them.

Resources