Special character (Hawaiian 'Okina) leads to weird string behavior - sql-server

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. What's going on here? Am I missing something? Do other characters suffer from this same problem?
SELECT UNICODE(N'ʻ') -- Returns 699 as expected.
SELECT REPLACE(N'"ʻ', '"', '_') -- Returns "ʻ, I expected _ʻ
SELECT REPLACE(N'aʻ', 'a', '_') -- Returns aʻ, I expected _ʻ
SELECT REPLACE(N'"ʻ', N'ʻ', '_') -- Returns __, I expected "_
SELECT REPLACE(N'-', N'ʻ', '_') -- Returns -, I expected -
Also, strange when used in a LIKE for example:
DECLARE #table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
#table
VALUES
('John'),
('Jane')
SELECT
*
FROM
#table
WHERE
[Name] LIKE N'%ʻ%' -- This returns both records. I expected none.

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. ... Do other characters suffer from this same problem?
A few things:
This is not a Hawaiian "quote": it's a "glottal stop" which affects pronunciation.
It is not "weird" behavior: it's just not what you were expecting.
This behavior is not specifically a "problem", though yes, there are other characters that exhibit similar behavior. For example, the following character (U+02DA Ring Above) behaves slightly differently depending on which side of a character it is on:
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'˚a', N'_'); -- Returns a_a
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'a˚', N'_'); -- Returns _aa
Now, anyone using SQL Server 2008 or newer should be using a 100 (or newer) level collation. They added a lot of sort weights and uppercase/lowercase mappings in the 100 series that aren't in the 90 series, or the non-numbered series, or the mostly obsolete SQL Server collations (those with names starting with SQL_).
The issue here is not that it doesn't equate to any other character (outside of a binary collation), and in fact it actually does equate to one other character (U+0312 Combining Turned Comma Above):
;WITH nums AS
(
SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY ##MICROSOFTVERSION) - 1) AS [num]
FROM [master].sys.all_columns ac1
CROSS JOIN [master].sys.all_columns ac2
)
SELECT nums.[num] AS [INTvalue],
CONVERT(BINARY(2), nums.[num]) AS [BINvalue],
NCHAR(nums.[num]) AS [Character]
FROM nums
WHERE NCHAR(nums.[num]) = NCHAR(0x02BB) COLLATE Latin1_General_100_CI_AS;
/*
INTvalue BINvalue Character
699 0x02BB ʻ
786 0x0312 ̒
*/
The issue is that this is a "spacing modifier" character, and so it attaches to, and modifies the meaning / pronunciation of, the character before or after it, depending on which modifier character you are dealing with.
According to the Unicode Standard, Chapter 7 (Europe-I), Section 7.8 (Modifier Letters), Page 323 (of the document, not of the PDF):
7.8 Modifier Letters
Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc = Mn or gc = Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, indicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound.
...
Spacing Modifier Letters: U+02B0–U+02FF
Phonetic Usage. The majority of the modifier letters in this block are phonetic modifiers, including the characters required for coverage of the International Phonetic Alphabet. In many cases, modifier letters are used to indicate that the pronunciation of an adjacent letter is different in some way—hence the name “modifier.” They are also used to mark stress or tone, or may simply represent their own sound.
The examples below should help illustrate. I am using a level 100 collation, and it needs to be accent-sensitive (i.e. name contains _AS):
SELECT REPLACE(N'ʻ' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _
SELECT REPLACE(N'ʻa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _a
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns __aa
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns ʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns aʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻa', N'_'); -- Returns _a
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'a', N'_'); -- Returns aʻ__
SELECT REPLACE(N'אʻaa' COLLATE Latin1_General_100_CI_AS, N'א', N'_'); -- Returns אʻaa
SELECT REPLACE(N'ffʻaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns ffʻaa
SELECT REPLACE(N'ffaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns _aa
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AS); -- 3
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AI); -- 1
SELECT 1 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AS; -- (0 rows returned)
SELECT 2 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AI; -- 2
If you need to deal with such characters in a way that ignores their intended linguistic behavior, then yes, you must use a binary collation. In such cases, please use the most recent level of collation, and BIN2 instead of BIN (assuming you are using SQL Server 2005 or newer). Meaning:
SQL Server 2000: Latin1_General_BIN
SQL Server 2005: Latin1_General_BIN2
SQL Server 2008, 2008 R2, 2012, 2014, and 2016: Latin1_General_100_BIN2
SQL Server 2017 and newer: Japanese_XJIS_140_BIN2
If you are curious why I make that recommendation, please see:
Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)
And, for more information on collations / Unicode / encodings / etc, please visit: Collations Info

I cannot provide a detailed answer, but i can provide a solution to fulfill your expectations.
This has to do with collations, though I'm not sure why the Windows collations give unexpected results. If you use a binary collation, you get expected results (see Solomons excellent answer for which BIN to use):
SELECT REPLACE(N'aʻ' COLLATE Latin1_General_BIN, N'a', N'_')
Returns _ʻ
DECLARE #table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
#table
VALUES
(N'John'),
(N'Jane'),
(N'Hawaiʻi'),
(N'Hawai''i'),
(NCHAR(699))
SELECT
*
FROM
#table
WHERE
[Name] like N'%ʻ%' COLLATE Latin1_General_BIN
Returns:
Hawaiʻi
ʻ
You can check which collation confirms your expectations with the following code (Adapted from code by #SolomonRutzky (source)). It evaluates SELECT REPLACE(N'"ʻ', N'ʻ', N'_')) = '"_' for all collations:
DECLARE #SQL NVARCHAR(MAX) = N'DECLARE #Counter INT = 1;';
SELECT #SQL += REPLACE(N'
IF((SELECT REPLACE(N''"ʻ'' COLLATE {Name}, N''ʻ'', N''_'')) = ''"_'')
BEGIN
RAISERROR(N''%4d. {Name}'', 10, 1, #Counter) WITH NOWAIT;
SET #Counter += 1;
END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM sys.fn_helpcollations() col
ORDER BY col.[name]
--PRINT #SQL;
EXEC (#SQL);

Related

"Create sql function , select english characters?"

I am looking for a function that selects English numbers and letters only:
Example:
TEKA תנור ביל דין in HLB-840 P-WH לבן
I want to run a function and get the following result:
TEKA HLB-840 P-WH
I'm using MS SQL Server 2012
What you really need here is regex replacement, which SQL Server does not support. Broadly speaking, you would want to find [^A-Za-z0-9 -]+\s* and then replace with empty string. Here is a demo showing that this works as expected:
Demo
This would output TEKA in HLB-840 P-WH for the input you provided. You might be able to do this in SQL Server using a regex package or UDF. Or, you could do this replacement outside of SQL using any number of tools which support regex (e.g. C#).
SQL-Server is not the right tool for this.
The following might work for you, but there is no guarantee:
declare #yourString NVARCHAR(MAX)=N'TEKA תנור ביל דין in HLB-840 P-WH לבן';
SELECT REPLACE(REPLACE(REPLACE(REPLACE(CAST(#yourString AS VARCHAR(MAX)),'?',''),' ','|~'),'~|',''),'|~',' ');
The idea in short:
A cast of NVARCHAR to VARCHAR will return all characters in your string, which are not known in the given collation, as question marks. The rest is replacements of question marks and multi-blanks.
If your string can include a questionmark, you can replace it first to a non-used character, which you re-replace at the end.
If you string might include either | or ~ you should use other characters for the replacements of multi-blanks.
You can influence this approach by specifying a specific collation, if some characters pass by...
there is no build in function for such purpose, but you can create your own function, should be something like this:
--create function (split string, and concatenate required)
CREATE FUNCTION dbo.CleanStringZZZ ( #string VARCHAR(100))
RETURNS VARCHAR(100)
BEGIN
DECLARE #B VARCHAR(100) = '';
WITH t --recursive part to create sequence 1,2,3... but will better to use existing table with index
AS
(
SELECT n = 1
UNION ALL
SELECT n = n+1 --
FROM t
WHERE n <= LEN(#string)
)
SELECT #B = #B+SUBSTRING(#string, t.n, 1)
FROM t
WHERE SUBSTRING(#string, t.n, 1) != '?' --this is just an example...
--WHERE ASCII(SUBSTRING(#string, t.n, 1)) BETWEEN 32 AND 127 --you can use something like this
ORDER BY t.n;
RETURN #B;
END;
and then you can use this function in your select statement:
SELECT dbo.CleanStringZZZ('TEKA תנור ביל דין in HLB-840 P-WH לבן');
create function dbo.AlphaNumericOnly(#string varchar(max))
returns varchar(max)
begin
While PatIndex('%[^a-z0-9]%', #string) > 0
Set #string = Stuff(#string, PatIndex('%[^a-z0-9]%', #string), 1, '')
return #string
end

SQL Server 2016 How to use a simple Regular Expression in T-SQL?

I have a column with the name of a person in the following format: "LAST NAME, FIRST NAME"
Only Upper Cases Allowed
Space after comma optional
I would like to use a regular expression like: [A-Z]+,[ ]?[A-Z]+ but I do not know how to do this in T-SQL. In Oracle, I would use REGEXP_LIKE, is there something similar for SQL Server 2016?
I need something like the following:
UPDATE table
SET is_correct_format = 'YES'
WHERE REGEXP_LIKE(table.name,'[A-Z]+,[ ]?[A-Z]+');
First, case sensitivity depends on the collation of the DB, though with LIKE you can specify case comparisons. With that... here is some Boolean logic to take care of the cases you stated. Though, you may need to add additional clauses if you discover some bogus input.
declare #table table (Person varchar(64), is_correct_format varchar(3) default 'NO')
insert into #table (Person)
values
('LowerCase, Here'),
('CORRECTLY, FORMATTED'),
('CORRECTLY,FORMATTEDTWO'),
('ONLY FIRST UPPER, LowerLast'),
('WEGOT, FormaNUMB3RStted'),
('NoComma Formatted'),
('CORRECTLY, TWOCOMMA, A'),
(',COMMA FIRST'),
('COMMA LAST,'),
('SPACE BEFORE COMMA , GOOD'),
(' SPACE AT BEGINNING, GOOD')
update #table
set is_correct_format = 'YES'
where
Person not like '%[^A-Z, ]%' --check for non characters, excluding comma and spaces
and len(replace(Person,' ','')) = len(replace(replace(Person,' ',''),',','')) + 1 --make sure there is only one comma
and charindex(',',Person) <> 1 --make sure the comma isn't at the beginning
and charindex(',',Person) <> len(Person) --make sure the comma isn't at the end
and substring(Person,charindex(',',Person) - 1,1) <> ' ' --make sure there isn't a space before comma
and left(Person,1) <> ' ' --check preceeding spaces
and UPPER(Person) = Person collate Latin1_General_CS_AS --check collation for CI default (only upper cases)
select * from #table
The tsql equivalent could look like this. I'm not vouching for the efficiency of this solution.
declare #table as table(name varchar(20), is_Correct_format varchar(5))
insert into #table(name) Values
('Smith, Jon')
,('se7en, six')
,('Billy bob')
UPDATE #table
SET is_correct_format = 'YES'
WHERE
replace(name, ', ', ',x')
like (replicate('[a-z]', charindex(',', name) - 1)
+ ','
+ replicate('[a-z]', len(name) - charindex(',', name)) )
select * from #table
The optional space is hard to solve, so since it's next to a legal character I'm just replacing with another legal character when it's there.
TSQL does not provide the kind of 'repeating pattern' of * or + in regex, so you have to count the characters and construct the pattern that many times in your search pattern.
I split the string at the comma, counted the alphas before and after, and built a search pattern to match.
Clunky, but doable.

Detect UNICODE characters that are not ASCII in table

I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?
You can find non-ASCII characters quite simply:
SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))
If you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as a row that can't convert to VARCHAR. For example:
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2
Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.
Please note that only MAX types are used. Do not use NVARCHAR(4000) or VARCHAR(4000) else you might get false positives due to truncation of data in NVARCHAR(MAX) columns.
So, in terms of the example code in the question, the query would be (assuming that a Latin1_General collation is being used):
SELECT usr.*
FROM dbo.[users] usr
WHERE usr.[name] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR usr.[address] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR usr.[description] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));
There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii and then pass the result to char and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:
declare #i int = 1
declare #x nvarchar(10) = N'vsdǣf'
declare #result nvarchar(100) = N''
while (#i < len(#x))
begin
if char(ascii(substring(#x,#i,1))) = '?'
begin
set #result = #result + substring(#x,#i,1)
end
set #i = #i+1
end
select #result

How can I search for a sequence of bytes in SQL Server varbinary(max) field?

I am trying to write a query on SQL Server 2012 that will return varbinary(max) columns that contain a specified byte sequence. I am able to do that with a query that converts the varbinary field to varchar and uses LIKE:
SELECT * FROM foo
WHERE CONVERT(varchar(max), myvarbincolumn) LIKE
'%' + CONVERT(varchar(max), 0x626C6168) + '%'
where "0x626C6168" is my target byte sequence. Unfortunately, this works only if the field does not contain any bytes with the value zero (0x00) and those are very common in my data. Is there a different approach I can take that will work with values that contain zero-valued bytes?
If you use a binary collation it should work.
WITH foo(myvarbincolumn) AS
(
SELECT 0x00626C616800
)
SELECT *
FROM foo
WHERE CONVERT(VARCHAR(max), myvarbincolumn) COLLATE Latin1_General_100_BIN2
LIKE '%' + CONVERT(VARCHAR(max), 0x626C6168) + '%'
You might need (say) Latin1_General_BIN if on an older version of SQL Server.
Unfortunately the solution proposed by Martin has a flaw.
In case the binary sequence in the search key contains any 0x25 byte, it will be translated to the % character (according to the ASCII table).
This character is then interpreted as a wildcard in the like clause, causing many unwanted results to show up.
-- A table with a binary column:
DECLARE #foo TABLE(BinCol VARBINARY(MAX));
INSERT INTO #foo (BinCol) VALUES (0x001125), (0x000011), (0x001100), (0x110000);
-- The search key:
DECLARE #key VARBINARY(MAX) = 0x1125; -- 0x25 is '%' in the ASCII table!
-- This returns ALL values from the table, because of the wildcard in the search key:
SELECT * FROM #foo WHERE
CONVERT(VARCHAR(max), BinCol) COLLATE Latin1_General_100_BIN2
LIKE ('%' + CONVERT(VARCHAR(max), #key) + '%');
To fix this issue, use the search clause below:
-- This returns just the correct value -> 0x001125
SELECT * FROM #foo WHERE
CHARINDEX
(
CONVERT(VARCHAR(max), #key),
CONVERT(VARCHAR(max), BinCol) COLLATE Latin1_General_100_BIN2
) > 0;
I just discovered this very simply query.
SELECT * FROM foo
WHERE CONVERT(varchar(max), myvarbincolumn,2) LIKE '%626C6168%'
The characters 0x aren't added to the left of the converted result for style 2.
https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver16#binary-styles

How VARCHAR/CHAR manages to store/render multinational symbols in SQL Server?

I have used to read that varchar (char) is used for storing ASCII characters with 1 bute per character while nvarchar (varchar) uses UNICODE with 2 bytes.
But which ASCII? In SSMS 2008 R2
DECLARE #temp VARCHAR(3); --CHAR(3)
SET #temp = 'ЮЯç'; --cyryllic + portuguese-specific letters
select #temp,datalength(#temp)
-- results in
-- ЮЯç 3
Update: Ooops, the result was really ЮЯс but not ЮЯç. Thanks, Martin
declare #table table
(
c1 char(4) collate Cyrillic_General_CS_AI,
c2 char(4) collate Latin1_General_100_CS_AS_WS
)
INSERT INTO #table VALUES (N'ЮЯçæ', N'ЮЯçæ')
SELECT c1,cast(c1 as binary(4)) as c1bin, c2, cast(c2 as binary(4)) as c2bin
FROM #table
Returns
c1 c1bin c2 c2bin
---- ---------- ---- ----------
ЮЯc? 0xDEDF633F ??çæ 0x3F3FE7E6
You can see that dependant upon the collation non ASCII characters can get lost or silently converted to near equivalents.
It's ASCII with a codepage which defines the upper 128 characters (128-255). This is controlled by the "collation" in SQL Server, and depending on the collation you use you can use a subset of "special" characters.
See this MSDN page.

Resources