SQL Server Collation / ADO.NET DataTable.Locale with different languages

SQL Server Collation / ADO.NET DataTable.Locale with different languages - sql-server

we have WinForms app which stores data in SQL Server (2000, we are working on porting it in 2008) through ADO.NET (1.1, working on porting to 4.0). Everything works fine if I read data previsouly written in Western-European locale (E.g.: "test", "test ù"), but now we have to be able to mix Western and non-Western alphabets as well (E.g.: "test - ۓےۑ" - these are just random arabic chars).
On the SQL Server side, database has been set with the Latin1_General collation, the field is a nvarchar(80). If I run a SQL SELECT statement (E.g.: "SELECT * FROM MyTable WHERE field = 'test - ۓےۑ'", don't mind about the "*" or the actual names) from Query Analyzer, I get no results; the same happens if I pass the Sql statement to an ADO.NET DataAdapter to fill a DataTable. My guess is that it has something to do with collation, but I don't know how to correct this: do I have to change to collation (SQL Server) to a different one? Or do I have to set the locale on the DataAdaoter/DataTable (ADO.NET)?
Thanks in advance to anyone who will help

Shouldn't you use N when comparing nvarchar with extended char. set?
SELECT * From TestTable WHERE GreekColCaseInsensitive = N'test - ۓےۑ'

Yes, the problem is most likely the collation. The Latin1_General collation does not include the rules to sort and compare non latin characters.
MSDN claims:
If you must store character data that reflects multiple languages, you can minimize collation compatibility issues by always using the Unicode nchar, nvarchar, and ntext data types instead of the char, varchar, text data types. Using the Unicode data types eliminates code page conversion issues.
Since you have already complied with this, you should read further on the info about Mixed Collation Environments here.
Additionally I want to add that just changing a collation is not something done easy, check the MSDN for SQL 2000:
When you set up SQL Server 2000, it is important to use the correct collation settings. You can change collation settings after running Setup, but you must rebuild the databases and reload the data. It is recommended that you develop a standard within your organization for these options. Many server-to-server activities can fail if the collation settings are not consistent across servers.
You can specify a collation on a per column bases however:
CREATE TABLE TestTable (
id int,
GreekColCaseInsensitive nvarchar(10) collate greek_ci_as,
LatinColCaseSensitive nvarchar(10) collate latin1_general_cs_as
)
Have a look at the different binary multilingual collations here. Depending on the charset you use, you should find one that fits your purpose.
If you are not able or willing to change the collation of a column you can also just specify the collation to be used in the query like:
SELECT * From TestTable
WHERE GreekColCaseInsensitive = N'test - ۓےۑ'
COLLATE latin1_general_cs_as
As jfrobishow pointed out the use of N in front of the string you want to use to compare is essential. What does it do:
It denotes that the subsequent string is in Unicode (the N actually stands for National language character set). Which means that you are passing an NCHAR, NVARCHAR or NTEXT value, as opposed to CHAR, VARCHAR or TEXT. See Article #2354 for a comparison of these data types.
You can find a quick rundown here.

Related

SQL Server 2019 CHARINDEX returns weird result

When I run the following query in SQL Server 2019, the result is 1, whereas it should be 0.
select CHARINDEX('αρ', 'αυρ')
What could be the problem?

As was mentioned in the comments it may be because you have not declared your string literals as Unicode strings but are using Unicode characters in the strings. SQL Server will be converting the strings to another codepage and doing a bad job of it. Try running this query to see the difference.
SELECT 'αρ', 'αυρ', N'αρ', N'αυρ'
On my server, this gives the following output:
a? a?? αρ αυρ
Another issue is that CHARINDEDX uses the collation of the input which I think is probably not set correctly in this instance. You can force a collation by setting it on one of the inputs. It is also possible to set it at the instance, database and column level.
There are different collations that may be applicable. These have different features, for example some are case sensitive some are not. Also not all collations are installed with every SQL Server instance. It would be worth running SELECT * from sys.fn_helpcollations() to see the descriptions of all the installed ones.
If you change your query to this you should get the result you are looking for.
SELECT CHARINDEX(N'αρ' COLLATE Greek_BIN, N'αυρ')

Varchar vs nvarchar - causing distinct values that we don't consider distinct

SQL Server 2019 - we have a column called Entity which is of type nvarchar(max). The data from this column is inserted from tables on the web as part of an automated process.
In querying for DISTINCT values in this column, we only expected one distinct value, but we actually were returned two. But the two values looked exactly the same inside SQL Server Management Studio.
So we added a CONVERT(varchar(max)) to the query in a new column, and we were able to see the difference, as follows:
Entity Converted
Security Law Security Law
Security Law Security ?Law
Does anyone know how or why this different value is occurring, and more importantly, how we can instruct SQL Server to treat these as duplicate values, by only analyzing the nvarchar version?

nvarchar() takes Unicode characters into account. Since you are copying data from web, there could be invisible characters.
you can use regex and extract ASCII characters alone and convert it to varchar so you get distinct values.

Strings used in query always sent with NVARCHAR syntax, even if the underlying column is not unicode

I'm noticing some odd behavior in the SQL generated for queries against string fields in MS SQL.
Server version: SQL Server 2014 12.0.5000.0
Collation: SQL_Latin1_General_CP1_CI_AS
Python version: 3.7
Our database has a mix of NVARCHAR (mostly newer) and VARCHAR (mostly older) fields. We are using SQLAlchemy to connect our Python application to the database, and even though we specify that a column is of type String (as opposed to Unicode), the executed SQL always comes out in NVARCHAR syntax (for example, N'foo').
This ends up creating some obvious problems, as a simple index lookup on a multi-million row table turns into a giant string re-encoding operation.
The workaround I discovered is to pass in bytestrings (a la s.encode("utf-8")) instead of strs, but this is incredibly error-prone and hackish. I expected SQLAlchemy to handle this automatically since I told it that I'm querying against a String column and not a Unicode column.
If this is supposed to happen automatically, then maybe it's because it doesn't know the database collation? If so, how would I go about setting this?
Finally, as another point of reference, we're using pymssql. I am aware, through previous experience before we were using SQLAlchemy, that pymssql does the same thing (it assumes unicode strings are NVARCHAR while bytestrings are not). Code here. As far as I can tell, SQLAlchemy just passes this off down the line. This behavior is a bit surprising to me since SQLAlchemy knows the column types and the type of connection/driver it's working with.
I'm not afraid to get my hands dirty, so if anyone happens to know where this could be reasonably patched, I'd be happy to contribute. My current investigation seems to indicate something to do with dialects and/or query/statement compilation.
I've uploaded a minimal example project to GitHub.
EDIT 2019-03-18: Updated with new information based on investigation.
EDIT 2019-03-23: Added GitHub repo with minimal example.

I was able to reproduce the issue. Your MCVE was very helpful.
It was interesting to see that, for your ORM example, SQL Profiler showed no evidence that SQLAlchemy was retrieving the column metadata before running the SELECT query against the table. Apparently it believes that it knows enough about the columns to construct a working query, even though (as it turns out) it is not necessarily the most efficient one.
I knew that SQLAlchemy's SQL Expression Language would retrieve the table metadata, so I tried a similar SELECT using
metadata = MetaData()
my_table = Table('test', metadata, autoload=True, autoload_with=engine)
stmt = select([my_table.c.id, my_table.c.key])\
.select_from(my_table)\
.where(my_table.c.key == value)
cnxn = engine.connect()
items = cnxn.execute(stmt).fetchall()
and although SQLAlchemy did indeed retrieve the metadata using
SELECT [INFORMATION_SCHEMA].[columns].[table_schema],
[INFORMATION_SCHEMA].[columns].[table_name],
[INFORMATION_SCHEMA].[columns].[column_name],
[INFORMATION_SCHEMA].[columns].[is_nullable],
[INFORMATION_SCHEMA].[columns].[data_type],
[INFORMATION_SCHEMA].[columns].[ordinal_position],
[INFORMATION_SCHEMA].[columns].[character_maximum_length],
[INFORMATION_SCHEMA].[columns].[numeric_precision],
[INFORMATION_SCHEMA].[columns].[numeric_scale],
[INFORMATION_SCHEMA].[columns].[column_default],
[INFORMATION_SCHEMA].[columns].[collation_name]
FROM [INFORMATION_SCHEMA].[columns]
WHERE [INFORMATION_SCHEMA].[columns].[table_name] = Cast(
N'test' AS NVARCHAR(max))
AND [INFORMATION_SCHEMA].[columns].[table_schema] = Cast(
N'dbo' AS NVARCHAR(max))
ORDER BY [INFORMATION_SCHEMA].[columns].[ordinal_position]
a portion of whose output is
TABLE_SCHEMA TABLE_NAME COLUMN_NAME IS_NULLABLE DATA_TYPE ORDINAL_POSITION CHARACTER_MAXIMUM_LENGTH
------------ ---------- ----------- ----------- --------- ---------------- ------------------------
dbo test id NO int 1 NULL
dbo test key NO varchar 2 50
the resulting SELECT query still used an nvarchar literal
SELECT test.id, test.[key]
FROM test
WHERE test.[key] = N'record123456'
Finally, I did the same tests using pyodbc instead of pymssql and the results were essentially the same. I was curious if SQLAlchemy's dialect for pyodbc might take advantage of setinputsizes to specify the parameter types (i.e., pyodbc.SQL_VARCHAR instead of pyodbc.SQL_WVARCHAR), but apparently it does not.
So, I'd say that for the time being your best bet is to continue encoding your string values into bytes that correspond to the character set of the varchar column you are querying (not utf-8). Of course, you can also dive into source code for the SQLAlchemy dialect(s) and submit a PR to make SQLAlchemy better.

SQL Server default character encoding

By default - what is the character encoding set for a database in Microsoft SQL Server?
How can I see the current character encoding in SQL Server?

Encodings
In most cases, SQL Server stores Unicode data (i.e. that which is found in the XML and N-prefixed types) in UCS-2 / UTF-16 (storage is the same, UTF-16 merely handles Supplementary Characters correctly). This is not configurable: there is no option to use either UTF-8 or UTF-32 (see UPDATE section at the bottom re: UTF-8 starting in SQL Server 2019). Whether or not the built-in functions can properly handle Supplementary Characters, and whether or not those are sorted and compared properly, depends on the Collation being used. The older Collations — names starting with SQL_ (e.g. SQL_Latin1_General_CP1_CI_AS) xor no version number in the name (e.g. Latin1_General_CI_AS) — equate all Supplementary Characters with each other (due to having no sort weight). Starting in SQL Server 2005 they introduced the 90 series Collations (those with _90_ in the name) that could at least do a binary comparison on Supplementary Characters so that you could differentiate between them, even if they didn't sort in the desired order. That also holds true for the 100 series Collations introduced in SQL Server 2008. SQL Server 2012 introduced Collations with names ending in _SC that not only sort Supplementary Characters properly, but also allow the built-in functions to interpret them as expected (i.e. treating the surrogate pair as a single entity). Starting in SQL Server 2017, all new Collations (the 140 series) implicitly support Supplementary Characters, hence there are no new Collations with names ending in _SC.
Starting in SQL Server 2019, UTF-8 became a supported encoding for CHAR and VARCHAR data (columns, variables, and literals), but not TEXT (see UPDATE section at the bottom re: UTF-8 starting in SQL Server 2019).
Non-Unicode data (i.e. that which is found in the CHAR, VARCHAR, and TEXT types — but don't use TEXT, use VARCHAR(MAX) instead) uses an 8-bit encoding (Extended ASCII, DBCS, or EBCDIC). The specific character set / encoding is based on the Code Page, which in turn is based on the Collation of a column, or the Collation of the current database for literals and variables, or the Collation of the Instance for variable / cursor names and GOTO labels, or what is specified in a COLLATE clause if one is being used.
To see how locales match up to collations, check out:
Windows Collation Name
SQL Server Collation Name
To see the Code Page associated with a particular Collation (this is the character set and only affects CHAR / VARCHAR / TEXT data), run the following:
SELECT COLLATIONPROPERTY( 'Latin1_General_100_CI_AS' , 'CodePage' ) AS [CodePage];
To see the LCID (i.e. locale) associated with a particular Collation (this affects the sorting & comparison rules), run the following:
SELECT COLLATIONPROPERTY( 'Latin1_General_100_CI_AS' , 'LCID' ) AS [LCID];
To view the list of available Collations, along with their associated LCIDs and Code Pages, run:
SELECT [name],
COLLATIONPROPERTY( [name], 'LCID' ) AS [LCID],
COLLATIONPROPERTY( [name], 'CodePage' ) AS [CodePage]
FROM sys.fn_helpcollations()
ORDER BY [name];
Defaults
Before looking at the Server and Database default Collations, one should understand the relative importance of those defaults.
The Server (Instance, really) default Collation is used as the default for newly created Databases (including the system Databases: master, model, msdb, and tempdb). But this does not mean that any Database (other than the 4 system DBs) is using that Collation. The Database default Collation can be changed at any time (though there are dependencies that might prevent a Database from having it's Collation changed). The Server default Collation, however, is not so easy to change. For details on changing all collations, please see: Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
The server/Instance Collation controls:
local variable names
CURSOR names
GOTO labels
Instance-level meta-data
The Database default Collation is used in three ways:
as the default for newly created string columns. But this does not mean that any string column is using that Collation. The Collation of a column can be changed at any time. Here knowing the Database default is important as an indication of what the string columns are most likely set to.
as the Collation for operations involving string literals, variables, and built-in functions that do not take string inputs but produces a string output (i.e. IF (#InputParam = 'something') ). Here knowing the Database default is definitely important as it governs how these operations will behave.
Database-level meta-data
The column Collation is either specified in the COLLATE clause at the time of the CREATE TABLE or an ALTER TABLE {table_name} ALTER COLUMN, or if not specified, taken from the Database default.
Since there are several layers here where a Collation can be specified (Database default / columns / literals & variables), the resulting Collation is determined by Collation Precedence.
All of that being said, the following query shows the default / current settings for the OS, SQL Server Instance, and specified Database:
SELECT os_language_version,
---
SERVERPROPERTY('LCID') AS 'Instance-LCID',
SERVERPROPERTY('Collation') AS 'Instance-Collation',
SERVERPROPERTY('ComparisonStyle') AS 'Instance-ComparisonStyle',
SERVERPROPERTY('SqlSortOrder') AS 'Instance-SqlSortOrder',
SERVERPROPERTY('SqlSortOrderName') AS 'Instance-SqlSortOrderName',
SERVERPROPERTY('SqlCharSet') AS 'Instance-SqlCharSet',
SERVERPROPERTY('SqlCharSetName') AS 'Instance-SqlCharSetName',
---
DATABASEPROPERTYEX(N'{database_name}', 'LCID') AS 'Database-LCID',
DATABASEPROPERTYEX(N'{database_name}', 'Collation') AS 'Database-Collation',
DATABASEPROPERTYEX(N'{database_name}', 'ComparisonStyle') AS 'Database-ComparisonStyle',
DATABASEPROPERTYEX(N'{database_name}', 'SQLSortOrder') AS 'Database-SQLSortOrder'
FROM sys.dm_os_windows_info;
Installation Default
Another interpretation of "default" could mean what default Collation is selected for the Instance-level collation when installing. That varies based on the OS language, but the (horrible, horrible) default for systems using "US English" is SQL_Latin1_General_CP1_CI_AS. In that case, the "default" encoding is Windows Code Page 1252 for VARCHAR data, and as always, UTF-16 for NVARCHAR data. You can find the list of OS language to default SQL Server collation here: Collation and Unicode support: Server-level collations. Keep in mind that these defaults can be overridden; this list is merely what the Instance will use if not overridden during install.
UPDATE 2018-10-02
SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes (not TEXT!). This is accomplished via a set of new collations, the names of which all end with _UTF8. This is an interesting capability that will definitely help some folks, but there are some "quirks" with it, especially when UTF-8 isn't being used for all columns and the Database's default Collation, so don't use it just because you have heard that UTF-8 is magically better. UTF-8 was designed solely for ASCII compatibility: to enable ASCII-only systems (i.e. UNIX back in the day) to support Unicode without changing any existing code or files. That it saves space for data using mostly (or only) US English characters (and some punctuation) is a side-effect. When not using mostly (or only) US English characters, data can be the same size as UTF-16, or even larger, depending on which characters are being used. And, in cases where space is being saved, performance might improve, but it might also get worse.
For a detailed analysis of this new feature, please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?".

If you need to know the default collation for a newly created database use:
SELECT SERVERPROPERTY('Collation')
This is the server collation for the SQL Server instance that you are running.

The default character encoding for a SQL Server database is iso_1, which is ISO 8859-1. Note that the character encoding depends on the data type of a column. You can get an idea of what character encodings are used for the columns in a database as well as the collations using this SQL:
select data_type, character_set_catalog, character_set_schema, character_set_name, collation_catalog, collation_schema, collation_name, count(*) count
from information_schema.columns
group by data_type, character_set_catalog, character_set_schema, character_set_name, collation_catalog, collation_schema, collation_name;
If it's using the default, the character_set_name should be iso_1 for the char and varchar data types. Since nchar and nvarchar store Unicode data in UCS-2 format, the character_set_name for those data types is UNICODE.

SELECT DATABASEPROPERTYEX('DBName', 'Collation') SQLCollation;
Where DBName is your database name.

I think this is worthy of a separate answer: although internally unicode data is stored as UTF-16 in Sql Server this is the Little Endian flavour, so if you're calling the database from an external system, you probably need to specify UTF-16LE.

You can see collation settings for each table like the following code:
SELECT t.name TableName, c.name ColumnName, collation_name
FROM sys.columns c
INNER JOIN sys.tables t on c.object_id = t.object_id where t.name = 'name of table';

Sql Server 2008 - Difference between collation types

I'm installing a new SQL Server 2008 server and are having some problems getting any usable information regarding different collations. I have searched SQL Server BOL and google'ed for an answer but can't seem to be able to find any usable information.
What is the difference between the Windows Collation "Finnish_Swedish_100" and "Finnish_Swedish"?
I suppose that the "_100"-version is a updated collation in SQL Server 2008, but what things have changed from the older version if that is the case?
Is it usually a good thing to have "Accent-sensitive" enabled? I know that it depends on the task and all that, but is there any well-known pros and cons to consider?
The "Binary" and "Binary-code point" parameters, in which cases should theese be enabled?

The _100 indicates a collation sequence new in SQL Server 2008, those with _90 are for 2005 and those with no suffix are 2000. I don't know what the differences are, and can't find any documentation. Unless you are doing linked server queries to another SQL server of a different version, I'd be tempted to go with the _100 one. Sorry I can't help with the differences.

The letters ÅÄÖ/åäö do not mix up with A and O just by setting the collation to AI (Accent Insensitive). That is however true for â and other "combinations" not part of the Swedish alphabet as individual letters. â will mix or not mix depending of the setting in question.
Since I have a lot of old databases I still need to communicate with, also using linked servers, I chose FINNISH _SWEDISH _CI _AS now that I'm installing SQL2008. That was the default setting for FINNISH _SWEDISH when the Windows collations first appeared in SQL Server.

Use the query below to try it out yourself.
As you can see, å, ä, etc. do not count as accented characters, and are sorted according to the Swedish alphabet when using the Finnish/Swedish collation.
However, the accents are only considered if you use the AS collation. For the AI collation, their order is unchanged, as if there was no accent at all.
CREATE TABLE #Test (
Number int identity,
Value nvarchar(20) NOT NULL
);
GO
INSERT INTO #Test VALUES ('àá');
INSERT INTO #Test VALUES ('áa');
INSERT INTO #Test VALUES ('aa');
INSERT INTO #Test VALUES ('aà');
INSERT INTO #Test VALUES ('áb');
INSERT INTO #Test VALUES ('ab');
-- w is considered an accented version of v
INSERT INTO #Test VALUES ('wa');
INSERT INTO #Test VALUES ('va');
INSERT INTO #Test VALUES ('zz');
INSERT INTO #Test VALUES ('åä');
GO
SELECT Number, Value FROM #Test ORDER BY Value COLLATE Finnish_Swedish_CI_AS;
SELECT Number, Value FROM #Test ORDER BY Value COLLATE Finnish_Swedish_CI_AI;
GO
DROP TABLE #Test;
GO

To address question 3 (info taken off the MSDN; wording theirs, format mine):
Binary (_BIN):
Sorts and compares data in SQL Server tables based on the bit patterns defined for each character.
Binary sort order is case-sensitive and accent-sensitive.
Binary is also the fastest sorting order.
If this option is not selected, SQL Server follows sorting and comparison rules as defined in dictionaries for the associated language or alphabet.
Binary-code point (_BIN2):
For Unicode data: Sorts and compares data in SQL Server tables based on Unicode code points.
For non-Unicode data: will use comparisons identical to binary sorts.
The advantage of using a Binary-code point sort order is that no data resorting is
required in applications that compare sorted SQL Server data. As a result, a Binary-code point sort order provides simpler application development and possible performance increases.
For more information, see Guidelines for Using BIN and BIN2 Collations.

To adress your question 1. Accent sensitive is a good thing to have enabled for Finnish-Swedish. Otherwise your "å"s and "ä"s will be sorted as "a"s and "ö"s as "o"s. (Assuming you will be using those kind of international characters).
More here: http://msdn.microsoft.com/en-us/library/ms143515.aspx (discusses both binary codepoint and accent sensitivity)

On Questions 2 and 3
Accent Sensitivity is something I would suggest turning OFF if you are accepting user data, and ON if you have clean, sanitized data.
Not being Finnish myself, I don't know how many words there are that are different depending on the ó ô õ or ö that they have in them. But if there are users entering data, you can be sure that they will NOT be consistent in their usage, and you want to be able to match them.
If you are gathering data from a dataset that you know the content of, and know the consistency of, then you will want to turn Accent Sensitivity ON because you know that the differences are purposeful.
The same questions apply when considering Question 3. (I'm mostly getting this from the link Tomalak provided) If the data is case and accent sensitive, then you want _BIN, because it will sort faster. If the data is irregular, and not case/accent sensitive, then you will want _BIN2, because it is designed for Unicode data.

To address qestion 2:
Yes, if accent's are required grammer for the given language.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight