Database collation - Change Impact - sql-server

I am really weak as far as databases are concerned so, please bear with me.
I have a database which is in Greek.CI_AI working without any issue with several applications. All servers that put data into this DB are on Greek locale.
However, an application treats its information and checks integrity constraints in a case-sensitive manner. I have not run into any issues with the specific application so far but I am concerned that I may have to deal with it later when the data will be more and the impact even bigger. What is the proper way to do this? I mean do I just change it or should I drop it and recreate it with the right collation? If I do not have to drop it, how will this affect the data?
Comparing the two I have not found differences.
http://collation-charts.org/mssql/mssql.0408.1253.Greek_CI_AI.html
http://collation-charts.org/mssql/mssql.0408.1253.Greek_CS_AI.html
Thanks for your help!

You should not change your db collation from Greek_CI_AI to Greek_CS_AI/BIN.
If your application checks integrity constraints in a case-sensitive manner it just means that your business rules require this approach and this case sensitivity is implemented directly in those constraints.
If you change database collation to Greek_CS_AI you can just break application code. If there are tables Table1 and Table2 in your database now, all the code can reference them as table1 and table2, but once your db collation becomes case sensitive, the objects table1 and table2 will not be found.
Also, what is the difference between Greek.CI_AI and Greek.BIN
To view this by your eyes, try to do some selects of your data adding ORDER BY col1 COLLATE Greek_CS_AI --Greek_CS_BIN to your SELECT statement
You'll find that in first case your uppercase/lowercase letters will be placed next to each other but lowercase will always precede uppercase within the same letter while in the second(BIN) case ALL the uppercase letters will precede ALL the lowercase letters like this:
This is because BIN collation compare characters based on their ascii codes.
Note that there is a bug in BIN collations that compare correctly only the first character of the string, for this reason if you ever need to use binary collation always use BIN2 collations that have no bug

Related

SQL Server 2012- Server collation and database collation

I have SQL Server 2012 installed that is used for a few different applications. One of our applications needs to be installed, but the company is saying that:
The SQL collation isn't correct, it needs to be: SQL_Latin1_General_CP1_CI_AS
You can just uninstall the SQL Server Database Engine & upon reinstall select the right collation.
What possible reason would this company have to want to change the collation of the database engine itself?
Yes, you are able to set the collation at the database level. To do so, here is an example:
USE master;
GO
ALTER DATABASE <DatabaseName>
COLLATE SQL_Latin1_General_CP1_CI_AS;
GO
You can alter the database Collation even after you have created the database using the following query
USE master;
GO
ALTER DATABASE Database_Name
COLLATE Your_New_Collation;
GO
For more information on database collation Read here
What possible reason would this company have to want to change the collation of the database engine itself?
The other two answers are speaking in terms of Database-level Collation, not Instance-level Collation (i.e. "database engine itself"). The most likely reason that the vendor has for wanting a highly specific Collation (not just a case-insensitive one of your choosing, for example) is that, like most folks, they don't really understand how Collations work, but what they do know is that their application works (i.e. does not get Collation conflict errors) when the Instance and Database both have a Collation of SQL_Latin1_General_CP1_CI_AS, which is the Collation of their Instance and Database (that they develop the app on), because that is the default Collation when installing on an OS having English as its language.
I'm guessing that they have probably had some customers report problems that they didn't know how to fix, but narrowed it down to those Instances not having SQL_Latin1_General_CP1_CI_AS as the Instance / Server -level Collation. The Instance-level Collation controls not just tempdb meta-data (and default column Collation when no COLLATE keyword is specified when creating local or global temporary tables), which has been mentioned by others, but also name resolution for variables / parameters, cursors, and GOTO labels. Even if unlikely that they would be using GOTO statements, they are certainly using variables / parameters, and likely enough to be using cursors.
What this means is that they likely had problems in one or more of the following areas:
Collation conflict errors related to temporary tables:
tempdb being in the Collation of the Instance does not always mean that there will be problems, even if the COLLATE keyword was never used in a CREATE TABLE #[#]... statement. Collation conflicts only occur when attempting to combine or compare two string columns. So assuming that they created a temporary table and used it in conjunction with a table in their Database, they would need to be JOINing on those string columns, or concatenating them, or combining them via UNION, or something along those lines. Under these circumstances, an error will occur if the Collations of the two columns are not identical.
Unexpected behavior:
Comparing a string column of a table to a variable or parameter will use the Collation of the column. Given their requirement for you to use SQL_Latin1_General_CP1_CI_AS, this vendor is clearly expecting case-insensitive comparisons. Since string columns of temp tables (that were not created using the COLLATE keyword) take on the Collation of the Instance, if the Instance is using a binary or case-sensitive Collation, then their application will not be returning all of the data that they were expecting it to return.
Code compilation errors:
Since the Instance-level Collation controls resolution of variable / parameter / cursor names, if they have inconsistent casing in any of their variable / parameter / cursor names, then errors will occur when attempting to execute the code. For example, doing this:
DECLARE #CustomerID INT;
SET #customerid = 5;
would get the following error:
Msg 137, Level 15, State 1, Line XXXXX
Must declare the scalar variable "#customerid".
Similarly, they would get:
Msg 16916, Level 16, State 1, Line XXXXX
A cursor with the name 'Customers' does not exist.
if they did this:
DECLARE customers CURSOR FOR SELECT 1 AS [Bob];
OPEN Customers;
These problems are easy enough to avoid, simply by doing the following:
Specify the COLLATE keyword on string columns when creating temporary tables (local or global). Using COLLATE DATABASE_DEFAULT is handy if the Database itself is not guaranteed to have a particular Collation. But if the Collation of the Database is always the same, then you can specify either DATABASE_DEFAULT or the particular Collation. Though I suppose DATABASE_DEFAULT works in both cases, so maybe it's the easier choice.
Be consistent in casing of identifiers, especially variables / parameters. And to be more complete, I should mention that Instance-level meta-data is also affected by the Instance-level Collation (e.g. names of Logins, Databases, server-Roles, SQL Agent Jobs, SQL Agent Job Steps, etc). So being consistent with casing in all areas is the safest bet.
Am I being unfair in assuming that the vendor doesn't understand how Collations work? Well, according to a comment made by the O.P. on M.Ali's answer:
I got this reply from him: "It's the other way around, you need the new SQL instance collation to match the old SQL collation when attaching databases to it. The collation is used in the functioning of the database, not just something that gets set when it's created."
the answer is "no". There are two problems here:
No, the Collations of the source and destination Instances do not need to match when attaching a Database to a new Instance. In fact, you can even attach a system DB to an Instance that has a different Collation, thereby having a mismatch between the attached system DB and the Instance and the other system DBs.
It's unclear if "database" in that last sentence means actual Database or the Instance (sometimes people use the term "database" to refer to the RDBMS as a whole). If it means actual "Database", then that is entirely irrelevant because the issue at hand is the Instance-level Collation. But, if the vendor meant the Instance, then while true that the Collation is used in normal operations (as noted above), this only shows awareness of simple cause-effect relationship and not actual understanding. Actual understanding would lead to doing those simple fixes (noted above) such that the Instance-level Collation was a non-issue.
If needing to change the Collation of the Instance, please see:
Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
For more info on working with Collations / encodings / Unicode / etc, please visit:
Collations.Info

Difference between SQL_Latin1_General_Cp437_CI_AS_KI_WI and SQL_Latin1_General_Cp850_CI_AS_KI_WI

I was reviewing the article titled Selecting a SQL Server Collation, trying to decide which one to use for a database, and I noticed two that seemed identical. Is there some sort of difference between these two that isn't listed on the page?
SQL_Latin1_General_Cp437_CI_AS_KI_WI
SQL sort order ID = 32
Sort order name = nocase.437
Description = Dictionary order, case-insensitive
SQL_Latin1_General_Cp850_CI_AS_KI_WI
SQL sort order ID = 42
Sort order name = nocase.850
Description = Dictionary order, case-insensitive
The numbers in Cp437 and Cp850 refer to code pages, and using the wrong code page could result in some curious results! I would highly recommend that you use a standard collation like SQL_Latin1_General_CP1_CI_AS (or _AI for accent insensitivity.) Picking a collation however is a tricky affair, and if you have lots of Unicode data, using a SQL collation can cause performance issues (as reported by some) as your indexes will not cover Unicode characters in nvarchar fields, or can cause unusual sort orders when mixed Unicode and non-Unicode data is present. See Collation Types for more information.
I would recommend that you either stick with the SQL Server default, which I listed above, or use a Windows collation based on careful selection. You will notice that English (United States) is actually the default, I'm unsure myself of whether or not that collation was made into a Unicode-supporting collation or not.
Other pages to look into are the Wikipedia article on ASCII and Extended ASCII which explain the code pages and their history.

SQL Server Column names case sensitivity

The DB I use has French_CI_AS collation (CI should stand for Case-Insensitive) but is case-sensitive anyway. I'm trying to understand why.
The reason I assert this is that bulk inserts with a 'GIVEN' case setup fail, but they succeed with another 'Given' case setup.
For example:
INSERT INTO SomeTable([GIVEN],[COLNAME]) VALUES ("value1", "value2") fails, but
INSERT INTO SomeTable([Given],[ColName]) VALUES ("value1", "value2") works.
EDIT
Just saw this:
http://msdn.microsoft.com/en-us/library/ms190920.aspx
so that means it should be possible to change a column's collation without emptying all the data and recreating the related table?
Given this critical piece of information (that is in a comment on the question and not in the actual question):
In fact I use Microsoft .Net's bulk insert method, so I don't really know the exact query it sends to the DB server.
it makes sense that the column names are being treated as case-sensitive, even in a case-insensitive DB, since that is how the SqlBulkCopy Class works. Please see Column mappings in SqlBulkCopy are case sensitive.
ADDITIONAL NOTES
When asking about an error, please always include the actual, and full, error message in the question. Simply saying that there was an error leads to a lot of guessing and wild-goose chases that in turn lead to off-topic answers.
When asking a question, please do not change the circumstances that you are dealing with. For example, the question states (emphasis added):
bulk inserts with a 'GIVEN' case setup fail, but they succeed with another 'Given' case setup.
Yet the example statements are single INSERTs. Also, a comment on the question states:
In fact I use Microsoft .Net's bulk insert method, so I don't really know the exact query it sends to the DB server.
Using .NET and SqlBulkCopy is waaaay different than using BULK INSERT or INSERT, making the current question misleading, making it difficult (or even impossible) to answer correctly. This new bit of info also leads to more questions because when using SqlBulkCopy, you don't write any INSERT statements: you just write a SELECT statement and specify the name of the destination Table. If you specify column names at all for the destination Table, it is in the optional column mappings. Is that where the issue is?
Regarding the "EDIT" section of the question:
No, changing the Collation of the column won't help at all, even if you weren't using SqlBulkCopy. The Collation of a column determines how data stored in the column behaves, not how the column names (i.e. meta-data of the Table) behaves. It is the Collation of the Database itself that determines how Database-level object meta-data behaves. And in this case, you claim that the DB is using a case-insensitive Collation (correct, the _CI_ portion of the Collation name does mean "Case Insensitive").
Regarding the following statements made by Jonathan Leffler on the question:
that gets into a very delicate area of the interaction between delimited identifiers (normally case-sensitive) and collations (this one is case-insensitive).
No, delimited identifiers are not normally case-sensitive. The sensitivities (case, accent, kana type, width, and starting in SQL Server 2017 variation selector) of delimited identifiers is the same as for non-delimited identifiers at that same level. "Same level" means that Instance-level names (Databases, Logins, etc) are controlled by the Instance-level Collation, while Database-level names (Schemas, Objects--Tables, Views, Functions, Stored Procedures, etc--, Users, etc) are controlled by the Database-level Collation. And these two levels can have different Collations.
you need to research whether the SQL column names in a database are case-sensitive when delimited. It may also depend on how the CREATE TABLE statement is written (were the names delimited in that?). Normally, SQL is case-insensitive on column and table names; you could write INSERT INTO SoMeTaBlE(GiVeN, cOlNaMe) VALUES("v1", "v2") and if the names were never delimited, it'd be OK.
It does not matter if the column names were delimited or not when creating the Table, at least not in terms of how their resolution is handled. Column names are Database-level meta-data, and that is controlled by the default Collation of the Database. And it is the same for all Database-level meta-data within each Databases. You cannot have some column names being case-sensitive while others are case-insensitive.
Also, there is nothing special about Table and column names. They are Database-level meta-data just like User names, Schema names, Index names, etc. All of this meta-data is controlled by the Database's default Collation.
Meta-data (both Instance-level and Database-level) is only "normally" case-insensitive due to the default Collation suggested during installation being a case-insensitive Collation.
a 'delimited identifier' is a column name, table name, or something similar enclosed in double quotes, such as CREATE TABLE "table"(...)
It is more accurate to say that a delimited identifier is an identifier enclosed in whatever character(s) the DBMS in question has defined as its delimiters. And which particular characters are used for delimiters varies between the different DBMSs.
In SQL Server, delimited identifiers are enclosed in square brackets: [GIVEN]
While square brackets always work as delimiters for identifiers, it is possible to use double-quotes as delimiters IF you have the session-level property of QUOTED_IDENTIFIER set to ON (which is best to always do anyway).
There are arcane parts to SQL (and delimited identifier handling is one of them)
Well, delimited identifiers are actually quite simple. The whole point of delimiting an identifier is to effectively ignore the rules of regular (i.e. non-delimited) identifiers. But, in terms of regular identifiers, yes, those rules are rather arcane (mainly due to the official documentation being incomplete and incorrect). So, in order to take the mystery out of how identifiers in SQL Server actually work, I did a bunch of research and published the results here (which includes links to the research itself):
Completely Complete List of Rules for T-SQL Identifiers
For more info on Collations / Encodings / Unicode / ASCII, especially as they relate to Microsoft SQL Server, please visit:
Collations.Info
The fact the column names are case sensitive means that the MASTER database has been created using a case sensitive collation.
In the case I just had that lead me to investigate this, someone entered
Latin1_CS_AI instead of Latin1_CI_AS
When setting up SQL server.
Check the collation of the columns in your table definition, and the collation of the tempdb database (i.e. the server collation). They may differ from your database collation.

in sql server, what is: Latin1_General_CI_AI versus Latin1_General_CI_AS

i am using SQL Compare to compare two versions of a database. it keeps highlighting differences in the nvarchar fields, where it shows one db that has:
Latin1_General_CI_AS
and the other one has this:
Latin1_General_CI_AI
can someone please explain what this is and if i should be worried about this difference
Accent Sensitive and Accent Insensitive
Lòpez and Lopez are the same if Accent Insensitive.
What the comparison is showing is that the two columns have difference collations. The collation of a (text) field affects how it is both stored and compared.
The particular difference in your case is that accents on characters will be ignored when comparisons and sorting is done.
When you install SQL Server, you set a default collation for the whole server. You can also set a collation per database and per column, meaning that you can mix them within a database (whether you want to depends on your particular case). The MSDN page I linked to has more information on collations, how to choose the best one, and how to set them.

Sql Server 2008 - Difference between collation types

I'm installing a new SQL Server 2008 server and are having some problems getting any usable information regarding different collations. I have searched SQL Server BOL and google'ed for an answer but can't seem to be able to find any usable information.
What is the difference between the Windows Collation "Finnish_Swedish_100" and "Finnish_Swedish"?
I suppose that the "_100"-version is a updated collation in SQL Server 2008, but what things have changed from the older version if that is the case?
Is it usually a good thing to have "Accent-sensitive" enabled? I know that it depends on the task and all that, but is there any well-known pros and cons to consider?
The "Binary" and "Binary-code point" parameters, in which cases should theese be enabled?
The _100 indicates a collation sequence new in SQL Server 2008, those with _90 are for 2005 and those with no suffix are 2000. I don't know what the differences are, and can't find any documentation. Unless you are doing linked server queries to another SQL server of a different version, I'd be tempted to go with the _100 one. Sorry I can't help with the differences.
The letters ÅÄÖ/åäö do not mix up with A and O just by setting the collation to AI (Accent Insensitive). That is however true for â and other "combinations" not part of the Swedish alphabet as individual letters. â will mix or not mix depending of the setting in question.
Since I have a lot of old databases I still need to communicate with, also using linked servers, I chose FINNISH _SWEDISH _CI _AS now that I'm installing SQL2008. That was the default setting for FINNISH _SWEDISH when the Windows collations first appeared in SQL Server.
Use the query below to try it out yourself.
As you can see, å, ä, etc. do not count as accented characters, and are sorted according to the Swedish alphabet when using the Finnish/Swedish collation.
However, the accents are only considered if you use the AS collation. For the AI collation, their order is unchanged, as if there was no accent at all.
CREATE TABLE #Test (
Number int identity,
Value nvarchar(20) NOT NULL
);
GO
INSERT INTO #Test VALUES ('àá');
INSERT INTO #Test VALUES ('áa');
INSERT INTO #Test VALUES ('aa');
INSERT INTO #Test VALUES ('aà');
INSERT INTO #Test VALUES ('áb');
INSERT INTO #Test VALUES ('ab');
-- w is considered an accented version of v
INSERT INTO #Test VALUES ('wa');
INSERT INTO #Test VALUES ('va');
INSERT INTO #Test VALUES ('zz');
INSERT INTO #Test VALUES ('åä');
GO
SELECT Number, Value FROM #Test ORDER BY Value COLLATE Finnish_Swedish_CI_AS;
SELECT Number, Value FROM #Test ORDER BY Value COLLATE Finnish_Swedish_CI_AI;
GO
DROP TABLE #Test;
GO
To address question 3 (info taken off the MSDN; wording theirs, format mine):
Binary (_BIN):
Sorts and compares data in SQL Server tables based on the bit patterns defined for each character.
Binary sort order is case-sensitive and accent-sensitive.
Binary is also the fastest sorting order.
If this option is not selected, SQL Server follows sorting and comparison rules as defined in dictionaries for the associated language or alphabet.
Binary-code point (_BIN2):
For Unicode data: Sorts and compares data in SQL Server tables based on Unicode code points.
For non-Unicode data: will use comparisons identical to binary sorts.
The advantage of using a Binary-code point sort order is that no data resorting is
required in applications that compare sorted SQL Server data. As a result, a Binary-code point sort order provides simpler application development and possible performance increases.
For more information, see Guidelines for Using BIN and BIN2 Collations.
To adress your question 1. Accent sensitive is a good thing to have enabled for Finnish-Swedish. Otherwise your "å"s and "ä"s will be sorted as "a"s and "ö"s as "o"s. (Assuming you will be using those kind of international characters).
More here: http://msdn.microsoft.com/en-us/library/ms143515.aspx (discusses both binary codepoint and accent sensitivity)
On Questions 2 and 3
Accent Sensitivity is something I would suggest turning OFF if you are accepting user data, and ON if you have clean, sanitized data.
Not being Finnish myself, I don't know how many words there are that are different depending on the ó ô õ or ö that they have in them. But if there are users entering data, you can be sure that they will NOT be consistent in their usage, and you want to be able to match them.
If you are gathering data from a dataset that you know the content of, and know the consistency of, then you will want to turn Accent Sensitivity ON because you know that the differences are purposeful.
The same questions apply when considering Question 3. (I'm mostly getting this from the link Tomalak provided) If the data is case and accent sensitive, then you want _BIN, because it will sort faster. If the data is irregular, and not case/accent sensitive, then you will want _BIN2, because it is designed for Unicode data.
To address qestion 2:
Yes, if accent's are required grammer for the given language.

Resources