SQL Server not difference between 'ی' and 'ي' in Arabic_CI_AS collation - sql-server

I'm using ASCII function for getting equivalent ASCII code of two characters, but I'm surprised when seeing there is no difference between 'ي' and 'ی', can anyone help me?
SELECT ASCII('ي'), ASCII('ی')

Because your character is non Unicode you have to use UNICODE() function instead of ASCII() .
SELECT ASCII('ي'), ASCII('ی')
will result: 237, 237
but
SELECT UNICODE(N'ي'), UNICODE(N'ی')
will result: 1610, 1740

Try this
SELECT UNICODE(N'ي'), UNICODE(N'ی')

Another solution by using the proper collate in case you want to use Ascii
Arabic_CS_AS_KS
result will come as ى = 236 and ي= 237

This is a limitation ASCII function. According to the documentation, ASCII:
Returns the ASCII code value of the leftmost character of a character expression.
However, the characters in your question are made up of more than one byte. It appears that ASCII can only read one byte.
When you use these characters as string literals without the N prefix, they are treated as single-byte characters. The following query shows that SQL Server does not treat these characters as equal in the Arabic_CI_AS collation when they are properly marked as multi-byte:
SELECT CASE WHEN 'ي' COLLATE Arabic_CI_AS <> 'ی' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_ascii,
CASE WHEN N'ي' COLLATE Arabic_CI_AS <> N'ی' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_unicode
The following query shows the bytes that make up the characters:
SELECT CAST(N'ي' COLLATE Arabic_CI_AS as varbinary(4)),
CAST(N'ی' COLLATE Arabic_CI_AS as varbinary(4)),
CAST('ي' COLLATE Arabic_CI_AS as varbinary(4)),
CAST('ی' COLLATE Arabic_CI_AS as varbinary(4))
However, even when you mark the characters as unicode, the ASCII function returns the same value because it can only read one byte:
SELECT ASCII(N'ي' COLLATE Arabic_CI_AS) , ASCII(N'ی' COLLATE Arabic_CI_AS)
EDIT As TT. points out, these characters don't have an entry in the ASCII code table.

The story becomes interesting when we have the following scripts:
SELECT ASCII('ك'), ASCII('ک');
SELECT
CASE
WHEN 'ك' COLLATE Arabic_CI_AS <> 'ک' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_ascii,
CASE WHEN N'ك' COLLATE Arabic_CI_AS <> N'ک' COLLATE Arabic_CI_AS
THEN 1 ELSE 0 END AS are_different_unicode;
The letter ک and ك seems to be an exception!
Isn't that so?

Related

Comparing romanian diacritics

I am working with romanian accented characters (diacritics) with Romanian_100_CI_AS collation.
Trying to select something regardless accented chars give me an unexpected result like this one:
IF N'tandarei' COLLATE Latin1_General_CI_AI = N'Țăndărei' COLLATE Latin1_General_CI_AI
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
returns
Values are different
"Țăndărei" is in a column with Romanian_100_CI_AS
what am I missing?

How to differentiate between 2 Arabic letters in SQL Server

In Arabic there are 2 letters that pronounced the same but written differently
The letter ة
and the letter ت
I wanted to replace the letter ة with another letter ه
Now I used this
Update MyTable
SET MyColumn = Replace ( MyColumn, N'ة' , N'ه' )
But ended with replacing every letter that has ة or ت to be replaced with ه
How can I tell SQL Server to replace only ة Not ت ?
Specify a COLLATE clause with a binary collation to use the code points of the exact characters to be searched/replaced:
UPDATE dbo.MyTable
SET MyColumn = REPLACE( MyColumn COLLATE Arabic_BIN, N'ة' COLLATE Arabic_BIN, N'ه' COLLATE Arabic_BIN);

Why does 'œ' match 'oe' in an NVarchar but not in a Varchar

SELECT REPLACE(N'Chloe', 'œ', 'o'), REPLACE('Chloe', 'œ', 'o')
Results in:
Chlo Chloe
This is super weird.
Another way:
SELECT
CASE WHEN N'œ' = N'oe' THEN 1 ELSE 0 END as NVarcharMatch,
CASE WHEN 'œ' = 'oe' THEN 1 ELSE 0 END as VarcharMatch
Results in:
NVarCharMatch VarcharMatch
1 0
Both legacy SQL collations ("SQL" collation prefix) and binary collations ("BIN" prefix compare only single characters at a time so "œ" can never equal "oe".
Windows collations and Unicode comparison use more robust comparison rules. This allows the single "œ" character to compare as equal to the 2 consecutive characters "oe" because they are semantically identical.
--Chlo because Unicode comparison equal
SELECT REPLACE(N'Chloe' COLLATE SQL_Latin1_General_CP1_CI_AS, 'œ', 'o');
--Chloe because legacy SQL comparison unequal
SELECT REPLACE('Chloe' COLLATE SQL_Latin1_General_CP1_CI_AS, 'œ', 'o');
--Chloe because binary comparison unequal
SELECT REPLACE('Chloe' COLLATE Latin1_General_BIN, 'œ', 'o');
--Chlo because Windows collation comparison equal
SELECT REPLACE('Chloe' COLLATE Latin1_General_CI_AS, 'œ', 'o');

Detect UNICODE characters that are not ASCII in table

I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?
You can find non-ASCII characters quite simply:
SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))
If you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as a row that can't convert to VARCHAR. For example:
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2
Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.
Please note that only MAX types are used. Do not use NVARCHAR(4000) or VARCHAR(4000) else you might get false positives due to truncation of data in NVARCHAR(MAX) columns.
So, in terms of the example code in the question, the query would be (assuming that a Latin1_General collation is being used):
SELECT usr.*
FROM dbo.[users] usr
WHERE usr.[name] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR usr.[address] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR usr.[description] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));
There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii and then pass the result to char and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:
declare #i int = 1
declare #x nvarchar(10) = N'vsdǣf'
declare #result nvarchar(100) = N''
while (#i < len(#x))
begin
if char(ascii(substring(#x,#i,1))) = '?'
begin
set #result = #result + substring(#x,#i,1)
end
set #i = #i+1
end
select #result

Select statement returns nothing when column collation SQL_Latin1_General_CP1_CI_AS in T-sql

I have a select statement as below:
SELECT Veri from tblTest
where CAST(Veri COLLATE SQL_Latin1_General_CP1_CI_AS as varchar(10))=
CAST('БHО' COLLATE SQL_Latin1_General_CP1_CI_AS as varchar(10))
Column Veri has collation of type SQL_Latin1_General_CP1_CI_AS.
There is a row with Veri equals БHО. However, select statement returns nothing.
Table tblTest's collation is also SQL_Latin1_General_CP1_CI_AS.
What am I doing wrong?
Edit: Column definition for column Veri is as follow:
CONDENSED_TYPE: nvarchar(50)
TABLE_SCHEMA: dbo
TABLE_NAME: tblTest
COLUMN_NAME: Veri
ORDINAL_POSITION: 2
COLUMN_DEFAULT: NULL
IS_NULLABLE: NO
DATA_TYPE: nvarchar
CHARACTER_MAXIMUM_LENGTH: 50
CHARACTER_OCTET_LENGTH: 100
NUMERIC_PRECISION:NULL
NUMERIC_PRECISION_RADIX: NULL
NUMERIC_SCALE: NULL
DATETIME_PRECISION: NULL
CHARACTER_SET_CATALOG: NULL
CHARACTER_SET_SCHEMA: NULL
COLLATION_NAME: SQL_Latin1_General_CP1_CI_AS
CHARACTER_SET_NAME: UNICODE
COLLATION_CATALOG: NULL
DOMAIN_SCHEMA: NULL
DOMAIN_NAME: NULL
In T/SQL the string constant 'БHО' is an ANSI string, and 'Б' is not available so you'll get the question marks that #EduardUta queried. You need to work with Unicode strings, using the N prefix for string constants and nvarchar. Try this;
SELECT Veri from tblTest
where CAST(Veri COLLATE SQL_Latin1_General_CP1_CI_AS as nvarchar(10)) =
CAST(N'БHО' COLLATE SQL_Latin1_General_CP1_CI_AS as nvarchar(10))
You may be able to remove the COLLATE directives - depends on your schema.
Another thing you can do is to examine a string character by character to see what each character actually is. For example, in your string 'БHО' it might look like the Cyrillic capital letter Be followed by the English letters H and O, but it's not, that's why you are not getting a match.
declare #s nvarchar(100) = N'БНО'
declare #i int = 0
while (#i <= len(#s))
begin
print substring(#s, #i, 1) + N' - 0x' + convert(varchar(8), convert(varbinary(4), unicode(substring(#s, #i, 1))), 2)
set #i = #i + 1
end
Try typing the Н and О in the string N'БНО' above and running again - you'll see 0x48 and 0x4F respectively.
Hope this helps,
Rhys

Resources