What are the benefits/drawbacks of using a case insensitive collation in SQL Server (in terms of query performance)?
I have a database that is currently using a case-insensitive collation, and I don't really like it. I would very much like to change it to case sensitive. What should I be aware of when changing the collation?
If you change the collation on the database, you also have to change it on each column individually - they maintain the collation setting that was in force when their table was created.
create database CollTest COLLATE Latin1_General_CI_AI
go
use CollTest
go
create table T1 (
ID int not null,
Val1 varchar(50) not null
)
go
select name,collation_name from sys.columns where name='Val1'
go
alter database CollTest COLLATE Latin1_General_CS_AS
go
select name,collation_name from sys.columns where name='Val1'
go
Result:
name collation_name
---- --------------
Val1 Latin1_General_CI_AI
name collation_name
---- --------------
Val1 Latin1_General_CI_AI
(I added this as a separate answer because its substantially different than my first.)
Ok, found some actual documentation. This MS KB article says that there are performance differences between different collations, but not where you think. The difference is between SQL collations (backward compatible, but not unicode aware) and Windows collations (unicode aware):
Generally, the degree of performance difference between the Windows and the SQL collations will not be significant. The difference only appears if a workload is CPU-bound, rather than being constrained by I/O or by network speed, and most of this CPU burden is caused by the overhead of string manipulation or comparisons performed in SQL Server.
Both SQL and Windows collations have case sensitive and case insensitive versions, so it sounds like that isn't the primary concern.
Another good story "from the trenches" in Dan's excellent article titled "Collation Hell":
I inherited a mixed collation environment with more collations than I can count on one hand. The different collations require workarounds to avoid "cannot resolve collation conflict" errors and those workarounds kill performance due to non-sargable expressions. Dealing with mixed collations is a real pain so I strongly recommend you standardize on a single collation and deviate only after careful forethought.
He concludes:
I personally don't think performance should even be considered in choosing the proper collation. One of the reasons I'm living in collation hell is that my predecessors chose binary collations to eke out every bit of performance for our highly transactional OLTP systems. With the sole exception of a leading wildcard table scan search, I've found no measurable performance difference with our different collations. The real key to performance is query and index tuning rather than collation. If performance is important to you, I recommend you perform a performance test with your actual application queries before you choose a collation on based on performance expectations.
Hope this helps.
I would say the biggest drawback to changing to a case sensitive collation in a production database would be that many, if not most, of your queries would fail because they are currently designed to ignore case.
I've not tried to change collation on an existing datbase, but I suspect it could be quite time consuming to do as well. You probably will have to lock your users out completely while the process happens too. Do not try this unless you have thoroughly tested on dev.
I can't find anything to confirm whether properly constructed queries work faster on a case-sensitive vs case-insensitive database (although I suspect the difference is negligible), but a few things are clear to me:
If your business requirements don't ask for it, you are putting yourself up to a lot of extra work (this is the crux of both HLGEM and Damien_The_Unbeliever's answers).
If your business requirements don't ask for it, you are setting yourself up for a lot of possible errors.
Its way too easy to construct poorly performing queries in a case-insensitive database if a case sensitive lookup is required:
A query like:
... WHERE UPPER(GivenName) = 'PETER'
won't use an index on GivenName. You would think something like:
... WHERE GivenName = 'PETER' COLLATE SQL_Latin1_General_CP1_CS_AS
would work better, and it does. But for maximum performance you'd have to do something like:
... WHERE GivenName = 'PETER' COLLATE SQL_Latin1_General_CP1_CS_AS
AND GivenName LIKE 'PETER'
(see this article for the details)
If you change the database collation but not the server collation (and they then don't match as a result), watch out when using temporary tables. Unless otherwise specified in their CREATE statement, they will use the server's default collation rather than that of the database which may cause JOINs or other comparisons against your DB's columns (assuming they're also changed to the DB's collation, as alluded to by Damien_The_Unbeliever) to fail.
Related
I'm using MSSQL 2012,
To handle special character in searching through LINQ, i found to change COLLATE of the column to *_CI_AI, but before changing it i would like to know what and where its impact.
This might be not so easy...
If this column takes part in indexes and constraints you will have to drop them, change the collation and recreate them.
One very painfull point with collations is the fact, that the temp-db uses - by default - the default-collation of the server-instance. We once had a project, where after such a step certain statements ran into errors. This happened, when a Stored Procedure created a #table' and used such a column in any kind of comparison (in WHEREorJOIN`-predicat).
You can type the collation in any statement manually, so it will be possible to get things working, but the effort might be huge...
Some related answers:
https://stackoverflow.com/a/39101572/5089204
https://stackoverflow.com/a/35840417/5089204
UPDATE a list if effects / impacts
sorting might change (a sorted list could appear in a different order)
comparisons will be less restrictive with _CI_AI. "Peter" eq. to "peter". Sometimes this is OK (most of the time actually), but not always (imagine a password). In cases where "Pétè" should be the same as "Pete" this helps...
Joins on string base might join differently (If ProductCode "aBx5" is not the same code as "ABx5")
Check-Constraints might be less restrictive (you force values "A","B" or "C" and suddenly you may insert "a","b" and "c"...)
You might run (this can be very annoying!) into collation errors in connection with temp tables. This can break existing code...
With simple text columns this should be not problem...
Since my SQL Server 2012 instance is using a collation (Latin1_General_CI_AS) different to some DBs in use (SQL_Latin1_General_CP1_CI_AI), I was evaluating possible risks of changing the collation of the databases using a different collation than the SQL Server instance.
I retrieved hundreds of procedure to perform this step. What is not clear for me is to understand if there are some constraints or risk on performing an action such this.
Thanks for any replies.
To understand the risk means you need to understand the difference. We can't tell you the impact in your system. The difference is the new collation would start finding matches where it didn't before.
Consider this query using your current collation. This will not return a row because those two values are not equal.
select 1
where 'e' = 'é' collate Latin1_General_CI_AS
Now since both of those characters are the letter 'e' but with different accents they will be equal when you ignore the accent.
select 1
where 'e' = 'é' collate SQL_Latin1_General_CP1_CI_AI
Again there is no way we can tell what the potential problems might be in your system because we don't know your system.
I'm new to Sybase and I really find it annoying to write sql with appropriate case for table names and column names. For eg, if the table name is 'Employee' I can't query as,
select * from employee
Is there a way to change this behavior in Sybase?
I don't want to change the sort order or anything. I'm looking for a hack to bypass this issue.
Cheers!!
As correctly pointed out in the other responses, this is a server-level configuration setting, which can be changed.
However, what is not mentioned is that in ASE, case-sensitivity applies equally to identifiers as well as to data comparison. So if you configure a case-insensitive sort order as discussed here, the effect will also be that 'Johnson' is now consider equal to 'JOHNSON' - and this could potentially cause trouble in applications.
In this sense, ASE is different from other databases where these two aspects of case-sensitivity are decoupled.
This behavior is a result of the servers sort order. This is a server level setting, not a database level setting, so the change will affect all databases on the server. Also if the database is in replication, all connected servers will need their sort order changed as well.
Changing the sort order will also require you to rebuild all the indexes in your system.
Here is the correct documentation on selecting or changing character sets and sort orders.
Configuring Character Sets, Sort Orders, and Languages
As mentioned in the comments, it will require DBA level access and the server will have to be restarted before changes will take affect.
We have a MS SQL Server 2005 installation that connects to an Oracle database through a linked server connection.
Lots of SELECT statements are being performed through a series of OPENQUERY() commands. The WHERE clause in the majority of these statements are against VARCHAR columns.
I've heard that if the WHERE clause is case sensitive, it can have a big impact on performance.
So my question is, how can I make sure that the non-binary string WHERE clauses are being performed in a case insensitive way for maximum performance?
It's actually the other way around:
Case sensitive...
WHERE column = :criteria
...will use index on column directly and perform well.
Case insensitivity typically requires something like this...
WHERE UPPER(column) = UPPER(:criteria)
...which does not use index on column and performs poorly (unless you are careful and create a functional index on UPPER(column)).
I'm not sure whether OPENQUERY() changes anything, but from purely Oracle perspective both case-sensitive and insensitive queries can be made performant, with the insensitive ones requiring special care (functional index).
By default SQL server uses a case insensitive collation where Oracle is case sensitive by default. For searches we normally implement the Upper() comparison to ensure the user has a better search experience.
I've heard that if the WHERE clause is case sensitive, it can have a big impact on performance.
From where did you hear that? Sounds like a myth to me... rather it would be other way around, ie if you'd use something like WHERE lower(field) = 'some str' to achieve case-insentive comparision it would be bad on perfomance. Using case-insensitive collation would probably be significantly faster...
Another important point to consider is do your business rules actually allow case-insensitive comparision.
And last but not least, you should start to optimize when you indeed do have a perfomance problem, not because you heard something...
WHERE LOWER(field_name) = 'field_value'
To make WHERE clause case insensitive, you can use LOWER or UPPER for this purpose.
select * from Table_Name
where lower(Column_Name) = lower('mY Any Value')
OR
select * from Table_Name
where UPPER(Column_Name) = UPPER('mY Any Value')
it seems that despite the fact that SQL Server does not match on case in a WHERE clause it still honours UPPER/LOWER in a WHERE clause which seems to be quite expensive. Is it possible to instruct SQL Server to disregard UPPER/LOWER in a WHERE clause?
This might seem like a pointless question but it's very nice to be able to write a single query for both Oracle and SQL Server.
Thanks, Jamie
The short answer to your question is no - you can't have SQL server magically ignore function calls in the WHERE clause.
As others have said, the performance issue is caused because, on SQL Server, using a function in the WHERE clause prevents the use of an index and forces a table scan.
To get best performance, you need to maintain two queries, one for each RDBMS platform (either in your application or in database objects like stored procedures or views). Given that so many other areas of functionality differ between Oracle and SQL Server, you're likely to end up doing it anyway, for something else if not for this.
So you mean something like:
WHERE YourColumn = #YourValue collate Latin1_General_BIN
But if you want it to work without the collate keyword, you could just set the collation of the column to something which is case insensitive.
Bear in mind that an index on YourColumn will be using a particular collation, so if you specify the collation in the WHERE clause (rather than on the column itself), an index will be less useful. I liken this to the fact that when I flew in Sweden a few years ago, I couldn't find Vasteras on the map, because the letters I thought were a actually had accents on them and were located at the end of the alphabet. The index in the back of the map wasn't so good when I was trying to use the wrong collation.