Why does 'œ' match 'oe' in an NVarchar but not in a Varchar - sql-server

SELECT REPLACE(N'Chloe', 'œ', 'o'), REPLACE('Chloe', 'œ', 'o')
Results in:
Chlo Chloe
This is super weird.
Another way:
SELECT
CASE WHEN N'œ' = N'oe' THEN 1 ELSE 0 END as NVarcharMatch,
CASE WHEN 'œ' = 'oe' THEN 1 ELSE 0 END as VarcharMatch
Results in:
NVarCharMatch VarcharMatch
1 0

Both legacy SQL collations ("SQL" collation prefix) and binary collations ("BIN" prefix compare only single characters at a time so "œ" can never equal "oe".
Windows collations and Unicode comparison use more robust comparison rules. This allows the single "œ" character to compare as equal to the 2 consecutive characters "oe" because they are semantically identical.
--Chlo because Unicode comparison equal
SELECT REPLACE(N'Chloe' COLLATE SQL_Latin1_General_CP1_CI_AS, 'œ', 'o');
--Chloe because legacy SQL comparison unequal
SELECT REPLACE('Chloe' COLLATE SQL_Latin1_General_CP1_CI_AS, 'œ', 'o');
--Chloe because binary comparison unequal
SELECT REPLACE('Chloe' COLLATE Latin1_General_BIN, 'œ', 'o');
--Chlo because Windows collation comparison equal
SELECT REPLACE('Chloe' COLLATE Latin1_General_CI_AS, 'œ', 'o');

Related

Comparing romanian diacritics

I am working with romanian accented characters (diacritics) with Romanian_100_CI_AS collation.
Trying to select something regardless accented chars give me an unexpected result like this one:
IF N'tandarei' COLLATE Latin1_General_CI_AI = N'Țăndărei' COLLATE Latin1_General_CI_AI
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
returns
Values are different
"Țăndărei" is in a column with Romanian_100_CI_AS
what am I missing?

How to differentiate between 2 Arabic letters in SQL Server

In Arabic there are 2 letters that pronounced the same but written differently
The letter ة
and the letter ت
I wanted to replace the letter ة with another letter ه
Now I used this
Update MyTable
SET MyColumn = Replace ( MyColumn, N'ة' , N'ه' )
But ended with replacing every letter that has ة or ت to be replaced with ه
How can I tell SQL Server to replace only ة Not ت ?
Specify a COLLATE clause with a binary collation to use the code points of the exact characters to be searched/replaced:
UPDATE dbo.MyTable
SET MyColumn = REPLACE( MyColumn COLLATE Arabic_BIN, N'ة' COLLATE Arabic_BIN, N'ه' COLLATE Arabic_BIN);

Special character (Hawaiian 'Okina) leads to weird string behavior

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. What's going on here? Am I missing something? Do other characters suffer from this same problem?
SELECT UNICODE(N'ʻ') -- Returns 699 as expected.
SELECT REPLACE(N'"ʻ', '"', '_') -- Returns "ʻ, I expected _ʻ
SELECT REPLACE(N'aʻ', 'a', '_') -- Returns aʻ, I expected _ʻ
SELECT REPLACE(N'"ʻ', N'ʻ', '_') -- Returns __, I expected "_
SELECT REPLACE(N'-', N'ʻ', '_') -- Returns -, I expected -
Also, strange when used in a LIKE for example:
DECLARE #table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
#table
VALUES
('John'),
('Jane')
SELECT
*
FROM
#table
WHERE
[Name] LIKE N'%ʻ%' -- This returns both records. I expected none.
The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. ... Do other characters suffer from this same problem?
A few things:
This is not a Hawaiian "quote": it's a "glottal stop" which affects pronunciation.
It is not "weird" behavior: it's just not what you were expecting.
This behavior is not specifically a "problem", though yes, there are other characters that exhibit similar behavior. For example, the following character (U+02DA Ring Above) behaves slightly differently depending on which side of a character it is on:
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'˚a', N'_'); -- Returns a_a
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'a˚', N'_'); -- Returns _aa
Now, anyone using SQL Server 2008 or newer should be using a 100 (or newer) level collation. They added a lot of sort weights and uppercase/lowercase mappings in the 100 series that aren't in the 90 series, or the non-numbered series, or the mostly obsolete SQL Server collations (those with names starting with SQL_).
The issue here is not that it doesn't equate to any other character (outside of a binary collation), and in fact it actually does equate to one other character (U+0312 Combining Turned Comma Above):
;WITH nums AS
(
SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY ##MICROSOFTVERSION) - 1) AS [num]
FROM [master].sys.all_columns ac1
CROSS JOIN [master].sys.all_columns ac2
)
SELECT nums.[num] AS [INTvalue],
CONVERT(BINARY(2), nums.[num]) AS [BINvalue],
NCHAR(nums.[num]) AS [Character]
FROM nums
WHERE NCHAR(nums.[num]) = NCHAR(0x02BB) COLLATE Latin1_General_100_CI_AS;
/*
INTvalue BINvalue Character
699 0x02BB ʻ
786 0x0312 ̒
*/
The issue is that this is a "spacing modifier" character, and so it attaches to, and modifies the meaning / pronunciation of, the character before or after it, depending on which modifier character you are dealing with.
According to the Unicode Standard, Chapter 7 (Europe-I), Section 7.8 (Modifier Letters), Page 323 (of the document, not of the PDF):
7.8 Modifier Letters
Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc = Mn or gc = Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, indicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound.
...
Spacing Modifier Letters: U+02B0–U+02FF
Phonetic Usage. The majority of the modifier letters in this block are phonetic modifiers, including the characters required for coverage of the International Phonetic Alphabet. In many cases, modifier letters are used to indicate that the pronunciation of an adjacent letter is different in some way—hence the name “modifier.” They are also used to mark stress or tone, or may simply represent their own sound.
The examples below should help illustrate. I am using a level 100 collation, and it needs to be accent-sensitive (i.e. name contains _AS):
SELECT REPLACE(N'ʻ' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _
SELECT REPLACE(N'ʻa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _a
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns __aa
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns ʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns aʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻa', N'_'); -- Returns _a
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'a', N'_'); -- Returns aʻ__
SELECT REPLACE(N'אʻaa' COLLATE Latin1_General_100_CI_AS, N'א', N'_'); -- Returns אʻaa
SELECT REPLACE(N'ffʻaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns ffʻaa
SELECT REPLACE(N'ffaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns _aa
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AS); -- 3
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AI); -- 1
SELECT 1 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AS; -- (0 rows returned)
SELECT 2 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AI; -- 2
If you need to deal with such characters in a way that ignores their intended linguistic behavior, then yes, you must use a binary collation. In such cases, please use the most recent level of collation, and BIN2 instead of BIN (assuming you are using SQL Server 2005 or newer). Meaning:
SQL Server 2000: Latin1_General_BIN
SQL Server 2005: Latin1_General_BIN2
SQL Server 2008, 2008 R2, 2012, 2014, and 2016: Latin1_General_100_BIN2
SQL Server 2017 and newer: Japanese_XJIS_140_BIN2
If you are curious why I make that recommendation, please see:
Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)
And, for more information on collations / Unicode / encodings / etc, please visit: Collations Info
I cannot provide a detailed answer, but i can provide a solution to fulfill your expectations.
This has to do with collations, though I'm not sure why the Windows collations give unexpected results. If you use a binary collation, you get expected results (see Solomons excellent answer for which BIN to use):
SELECT REPLACE(N'aʻ' COLLATE Latin1_General_BIN, N'a', N'_')
Returns _ʻ
DECLARE #table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
#table
VALUES
(N'John'),
(N'Jane'),
(N'Hawaiʻi'),
(N'Hawai''i'),
(NCHAR(699))
SELECT
*
FROM
#table
WHERE
[Name] like N'%ʻ%' COLLATE Latin1_General_BIN
Returns:
Hawaiʻi
ʻ
You can check which collation confirms your expectations with the following code (Adapted from code by #SolomonRutzky (source)). It evaluates SELECT REPLACE(N'"ʻ', N'ʻ', N'_')) = '"_' for all collations:
DECLARE #SQL NVARCHAR(MAX) = N'DECLARE #Counter INT = 1;';
SELECT #SQL += REPLACE(N'
IF((SELECT REPLACE(N''"ʻ'' COLLATE {Name}, N''ʻ'', N''_'')) = ''"_'')
BEGIN
RAISERROR(N''%4d. {Name}'', 10, 1, #Counter) WITH NOWAIT;
SET #Counter += 1;
END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM sys.fn_helpcollations() col
ORDER BY col.[name]
--PRINT #SQL;
EXEC (#SQL);

SQL Server not difference between 'ی' and 'ي' in Arabic_CI_AS collation

I'm using ASCII function for getting equivalent ASCII code of two characters, but I'm surprised when seeing there is no difference between 'ي' and 'ی', can anyone help me?
SELECT ASCII('ي'), ASCII('ی')
Because your character is non Unicode you have to use UNICODE() function instead of ASCII() .
SELECT ASCII('ي'), ASCII('ی')
will result: 237, 237
but
SELECT UNICODE(N'ي'), UNICODE(N'ی')
will result: 1610, 1740
Try this
SELECT UNICODE(N'ي'), UNICODE(N'ی')
Another solution by using the proper collate in case you want to use Ascii
Arabic_CS_AS_KS
result will come as ى = 236 and ي= 237
This is a limitation ASCII function. According to the documentation, ASCII:
Returns the ASCII code value of the leftmost character of a character expression.
However, the characters in your question are made up of more than one byte. It appears that ASCII can only read one byte.
When you use these characters as string literals without the N prefix, they are treated as single-byte characters. The following query shows that SQL Server does not treat these characters as equal in the Arabic_CI_AS collation when they are properly marked as multi-byte:
SELECT CASE WHEN 'ي' COLLATE Arabic_CI_AS <> 'ی' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_ascii,
CASE WHEN N'ي' COLLATE Arabic_CI_AS <> N'ی' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_unicode
The following query shows the bytes that make up the characters:
SELECT CAST(N'ي' COLLATE Arabic_CI_AS as varbinary(4)),
CAST(N'ی' COLLATE Arabic_CI_AS as varbinary(4)),
CAST('ي' COLLATE Arabic_CI_AS as varbinary(4)),
CAST('ی' COLLATE Arabic_CI_AS as varbinary(4))
However, even when you mark the characters as unicode, the ASCII function returns the same value because it can only read one byte:
SELECT ASCII(N'ي' COLLATE Arabic_CI_AS) , ASCII(N'ی' COLLATE Arabic_CI_AS)
EDIT As TT. points out, these characters don't have an entry in the ASCII code table.
The story becomes interesting when we have the following scripts:
SELECT ASCII('ك'), ASCII('ک');
SELECT
CASE
WHEN 'ك' COLLATE Arabic_CI_AS <> 'ک' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_ascii,
CASE WHEN N'ك' COLLATE Arabic_CI_AS <> N'ک' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_unicode;
The letter ک and ك seems to be an exception!
Isn't that so?

Select statement returns nothing when column collation SQL_Latin1_General_CP1_CI_AS in T-sql

I have a select statement as below:
SELECT Veri from tblTest
where CAST(Veri COLLATE SQL_Latin1_General_CP1_CI_AS as varchar(10))=
CAST('БHО' COLLATE SQL_Latin1_General_CP1_CI_AS as varchar(10))
Column Veri has collation of type SQL_Latin1_General_CP1_CI_AS.
There is a row with Veri equals БHО. However, select statement returns nothing.
Table tblTest's collation is also SQL_Latin1_General_CP1_CI_AS.
What am I doing wrong?
Edit: Column definition for column Veri is as follow:
CONDENSED_TYPE: nvarchar(50)
TABLE_SCHEMA: dbo
TABLE_NAME: tblTest
COLUMN_NAME: Veri
ORDINAL_POSITION: 2
COLUMN_DEFAULT: NULL
IS_NULLABLE: NO
DATA_TYPE: nvarchar
CHARACTER_MAXIMUM_LENGTH: 50
CHARACTER_OCTET_LENGTH: 100
NUMERIC_PRECISION:NULL
NUMERIC_PRECISION_RADIX: NULL
NUMERIC_SCALE: NULL
DATETIME_PRECISION: NULL
CHARACTER_SET_CATALOG: NULL
CHARACTER_SET_SCHEMA: NULL
COLLATION_NAME: SQL_Latin1_General_CP1_CI_AS
CHARACTER_SET_NAME: UNICODE
COLLATION_CATALOG: NULL
DOMAIN_SCHEMA: NULL
DOMAIN_NAME: NULL
In T/SQL the string constant 'БHО' is an ANSI string, and 'Б' is not available so you'll get the question marks that #EduardUta queried. You need to work with Unicode strings, using the N prefix for string constants and nvarchar. Try this;
SELECT Veri from tblTest
where CAST(Veri COLLATE SQL_Latin1_General_CP1_CI_AS as nvarchar(10)) =
CAST(N'БHО' COLLATE SQL_Latin1_General_CP1_CI_AS as nvarchar(10))
You may be able to remove the COLLATE directives - depends on your schema.
Another thing you can do is to examine a string character by character to see what each character actually is. For example, in your string 'БHО' it might look like the Cyrillic capital letter Be followed by the English letters H and O, but it's not, that's why you are not getting a match.
declare #s nvarchar(100) = N'БНО'
declare #i int = 0
while (#i <= len(#s))
begin
print substring(#s, #i, 1) + N' - 0x' + convert(varchar(8), convert(varbinary(4), unicode(substring(#s, #i, 1))), 2)
set #i = #i + 1
end
Try typing the Н and О in the string N'БНО' above and running again - you'll see 0x48 and 0x4F respectively.
Hope this helps,
Rhys

Resources