New database: should it be accent sensitive?

New database: should it be accent sensitive? - sql-server

I am preparing a new database server, where I will migrate data from a big, existing, multilingual database (mostly english/french/spanish text, rarely special characters from other languages for e.g. city names).
My question is: should it be accent sensitive?
Users would be happy if the search made no difference between cafe and café.
But as a DBA, I am worried: I have never seen a database not suffering from bad characters conversions at least once in a while. If I choose accent insensitive, how will I query the database and ask "give me all books where the title contains a special characters"?
If I have a way to do this, I would happily go for accent insensitive.

It should depend on your general usage.
This doesn't preclude you changing it for a specific query
eg
declare #t1 table (word nvarchar(50) collate Latin1_General_CI_AI)
declare #t2 table (word nvarchar(50) collate Latin1_General_CI_AS)
insert #t1 values ('cafe'),('restaurant'), ('café')
insert #t2 select * from #t1
select * from #t1 where word like 'cafe'
select * from #t1 where word like 'cafe' collate Latin1_General_CI_AS
select * from #t1 where word like 'café'
select * from #t1 where word like 'café' collate Latin1_General_CI_AS
select * from #t2 where word like 'cafe'
select * from #t2 where word like 'cafe' collate Latin1_General_CI_AI
select * from #t2 where word like 'café'
select * from #t2 where word like 'café' collate Latin1_General_CI_AI

You can change collation at select time:
with t as (
select 'ali' as w union
select 'àli' as w
)
select *
into #t
from t;
select * from #t
where w collate Latin1_General_CS_AS_KS_WS like '%à%'
w
---
àli
select * from #t
where w collate Traditional_Spanish_ci_ai like '%à%'
w
---
ali
àli

Related

Why SQL Server has error when i use LIKE operator?

I want to select employees that have the third character of the first name is ‘l’ (as image below)
After executing, there are 4 correct record and 1 incorrect.
I dont't understand why that record having first name is 'Philip' with l is fourth character is selected?
My SQL Statements

Yes this is a COLLATION specific probleme and those who gave a negative rating indiscriminately should think a little more than instinctively vote!
With the Vietnamese_100_... collation (which I think that is the case for our user HaNgocHieu) or for some others collations like Welsh_100... the Ph two letters are considered as only one and the result is that the query returns also Philip.
As a test :
CREATE TABLE #employee
( fname VARCHAR(20) NOT NULL,
lname VARCHAR(20) NOT NULL);
INSERT INTO #employee(fname,lname)
VALUES ('Philip','Cramer');
SELECT *
FROM #employee e
where fname + lname COLLATE Vietnamese_CI_AI LIKE '__l%';
fname lname
-------------------- --------------------
Philip Cramer
SELECT *
FROM #employee e
where fname + lname COLLATE French_CI_AI LIKE '__l%';
fname lname
-------------------- --------------------
SELECT *
FROM #employee e
where fname + lname COLLATE Welsh_100_CI_AI LIKE '__l%';
fname lname
-------------------- --------------------
Philip Cramer
So SQL Server has no error, nor HaNgocHieu does not make any mistake, but using a specific collations with non specific data can cause some trouble that can be solved in using an international COLLATION like those in latin

your data
declare #employee TABLE (
fname VARCHAR(20) NOT NULL
,lname VARCHAR(20) NOT NULL
);
INSERT INTO #employee(fname,lname) VALUES ('Helen','Bennett');
INSERT INTO #employee(fname,lname) VALUES ('Helvetius','Nagy');
INSERT INTO #employee(fname,lname) VALUES ('Palle','Ibsen');
INSERT INTO #employee(fname,lname) VALUES ('Philip','Cramer');
INSERT INTO #employee(fname,lname) VALUES ('Roland','Mendel');
for searching in column for specific charachter using CHARINDEX and SUBSTRING would be much better. In Addition, you must choose the best collate ( sorting rules, case, and accent sensitivity properties for your data) in your table such as COLLATE SQL_Latin1_General_CP1_CS_AS
SUBSTRING
select concat(fname,' ',lname) as fullname
FROM #employee e
where SUBSTRING(fname, 3, 1)='l' and fname SQL_Latin1_General_CP1_CS_AS
ORDER BY e.fname
CHARINDEX
select concat(fname,' ',lname) as fullname
FROM #employee e
where CHARINDEX('l', fname) = 3 and fname SQL_Latin1_General_CP1_CS_AS
ORDER BY e.fname

Remove the last value after a comma in nvarchar SQL

I have a column (Col1) with nvarchar(50) values like this: '17031021__,0,0.1,1'. I want to use this value to update another column(Col2), but remove the last number after the last comma (ex: '17031021__,0,0.1'). I think I need something like this:
CREATE TABLE table1 (
Col1 nvarchar(50),
Col2 nvarchar(50)
);
UPDATE table1
SET Col1 = '17031021__,0,0.1,1'
Select reverse(stuff(reverse(Col1), 1, 1, '')) As Col2
This is not quite right. What is the easiest way to achieve this?

Something along the following will give you a head-start.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, tokens NVARCHAR(100));
INSERT INTO #tbl (tokens) VALUES
('17031021__,0,0.1,1');
-- DDL and sample data population, end
SELECT *
, LEFT(tokens, pos) AS result
FROM #tbl
CROSS APPLY (SELECT LEN(tokens) - CHARINDEX(',', REVERSE(tokens))) AS t(pos);
And after you feel comfortable:
UPDATE #tbl
SET tokens = LEFT(tokens, LEN(tokens) - CHARINDEX(',', REVERSE(tokens)));
-- test
SELECT * FROM #tbl;

OK, the aforementioned solutions seemed like overkill, so I was able to find an easier solution that worked. Here is what I used generically:
UPDATE [dbo].[table1]
SET [Col2] = left (Col1, len(Col1) - charindex(',', reverse(Col1) + ','))
Very similar to the second solution above but with the ',' added at the end. This produced the desired result.

SQL sort order in Japanese breaks when text includes non-Japanese characters

It seems that Japanese sorting "breaks" when the text contains non-japanese text, even when forcing any possible collation after the sort part of the query.
I would like to know if this is a known phenomenon, and what a solution could be.
In the end I'm look for a kana type insensitive, case sensitive sorting, while searching should be kana type insensitive and case insensitive
Here is the test case:
I would assume from the script below, that I get the same results in both queries (the expected sort order is in the third column). Basically once I sort by the complete word, and once I sort manually by the first letter, then the second and then third letter.
Given the DB collation SQL_Latin1_General_CP1_CI_AS
declare #temp as table (title nvarchar(5), expected int, script varchar(40) )
set nocount on
INSERT INTO #temp values(N'かか7', 4,'hiragana no accent')
INSERT INTO #temp values(N'がが6',7,'hiragana with accent')
INSERT INTO #temp values(N'いい5',1,'earlier letter hiragana no accent')
INSERT INTO #temp values(N'カカ4',3, 'katakana no accent')
INSERT INTO #temp values(N'ガガ3',6, 'katakana with accent')
INSERT INTO #temp values(N'かか2',2, 'hiragana no accent')
INSERT INTO #temp values(N'がが1', 5, 'hiragana with accent')
--BAD
select unicode(left(title,1)) 'bin', * from #temp order by title
--GOOD
select unicode(left(title,1)) 'bin', * from #temp order by left(title,1),substring(title,2,1), substring(title,3,1)
However only the second version works, the first one doesn't sort correctly:
It seems it has to do with the numbers in the title field, since when I remove them, I do get the same order.
declare #temp as table (title nvarchar(5), expected int, script varchar(40) )
set nocount on
INSERT INTO #temp values(N'かか', 2,'hiragana no accent')
INSERT INTO #temp values(N'がが',3,'hiragana with accent')
INSERT INTO #temp values(N'いい',1,'earlier letter hiragana no accent')
INSERT INTO #temp values(N'カカ',2, 'katakana no accent')
INSERT INTO #temp values(N'ガガ',3, 'katakana with accent')
INSERT INTO #temp values(N'かか',2, 'hiragana no accent')
INSERT INTO #temp values(N'がが', 3, 'hiragana with accent')
--GOOD
select unicode(left(title,1)) 'bin', * from #temp order by title
--GOOD
select unicode(left(title,1)) 'bin', * from #temp order by left(title,1),substring(title,2,1)
See here the results:
Does anybody have a clue why, and possibly a solution?

Brute-force approach: Checking all supported collations in SQL Server:
create table ##temp(title nvarchar(5), expected int, script varchar(40) );
INSERT INTO ##temp values(N'かか7', 4,'hiragana no accent');
INSERT INTO ##temp values(N'がが6',7,'hiragana with accent');
INSERT INTO ##temp values(N'いい5',1,'earlier letter hiragana no accent');
INSERT INTO ##temp values(N'カカ4',3, 'katakana no accent');
INSERT INTO ##temp values(N'ガガ3',6, 'katakana with accent');
INSERT INTO ##temp values(N'かか2',2, 'hiragana no accent');
INSERT INTO ##temp values(N'がが1', 5, 'hiragana with accent');
And script:
CREATE TABLE result(collation_name NVARCHAR(1000));
DECLARE #collate_name NVARCHAR(1000);
DECLARE #sql NVARCHAR(MAX);
DECLARE c CURSOR FOR
SELECT name FROM sys.fn_helpcollations() /* where name LIKE '%japan%'*/;
OPEN c;
FETCH NEXT FROM c INTO #collate_name;
WHILE ##FETCH_STATUS = 0
BEGIN
SET #sql = REPLACE(
N'with cte as (
select bin = unicode(left(title,1)),expected
,rn= row_number() over(order by title collate <collate>)
,collation = ''<collate>''
from ##temp
)
select collation
from cte
where expected = rn GROUP BY collation HAVING COUNT(*) = 7'
, '<collate>', #collate_name);
-- debug
--PRINT #sql;
INSERT INTO result(collation_name) EXEC (#sql);
FETCH NEXT FROM c INTO #collate_name;
END
SELECT * FROM result;
CLOSE c;
DEALLOCATE c;
db<>fiddle demo
Result: There is no collation in SQL Server 2017 that will match "expected order".

Looks like your expected order is invalid for Japanese Collation.
select case when N'か' COLLATE Japanese_BIN2 < N'カ' COLLATE Japanese_BIN2 then 'True' else 'False' end
Sorting seems fine otherwise.
--CORRECT
select unicode(left(title,1)) 'bin', * from #temp order by title COLLATE Japanese_BIN2
--CORRECT
select unicode(left(title,1)) 'bin', substring(title,2,1), * from #temp order by left(title,2) COLLATE Japanese_BIN2, substring(title,3,1)
Both generates:
bin title expected script
12356 いい5 1 earlier letter hiragana no accent
12363 かか2 2 hiragana no accent
12363 かか7 4 hiragana no accent
12364 がが1 5 hiragana with accent
12364 がが6 7 hiragana with accent
12459 カカ4 3 katakana no accent
12460 ガガ3 6 katakana with accent
Edit (more):
declare #temp as table (title nvarchar(5) COLLATE Japanese_BIN2, script varchar(40) )
set nocount on
INSERT INTO #temp values(N'が', 'hiragana with accent')
INSERT INTO #temp values(N'い', 'earlier letter hiragana no accent')
INSERT INTO #temp values(N'カ', 'katakana no accent')
INSERT INTO #temp values(N'ガ', 'katakana with accent')
INSERT INTO #temp values(N'か', 'hiragana no accent')
select * from #temp order by title
title script
い earlier letter hiragana no accent
か hiragana no accent
が hiragana with accent
カ katakana no accent
ガ katakana with accent
With the collation set, adding on a trailing number becomes predictable.
declare #temp as table (title nvarchar(5) COLLATE Japanese_BIN2, script varchar(40) )
set nocount on
INSERT INTO #temp values(N'かか7','hiragana no accent')
INSERT INTO #temp values(N'がが6','hiragana with accent')
INSERT INTO #temp values(N'いい5','earlier letter hiragana no accent')
INSERT INTO #temp values(N'カカ4','katakana no accent')
INSERT INTO #temp values(N'ガガ3','katakana with accent')
INSERT INTO #temp values(N'かか2','hiragana no accent')
INSERT INTO #temp values(N'がが1','hiragana with accent')
select * from #temp order by title
title script
いい5 earlier letter hiragana no accent
かか2 hiragana no accent
かか7 hiragana no accent
がが1 hiragana with accent
がが6 hiragana with accent
カカ4 katakana no accent
ガガ3 katakana with accent
Edit (CI_AI)
Case Insensitive / Accent Insensitive Results with numeric appended at the end.
declare #temp as table (title nvarchar(5) COLLATE Japanese_90_CI_AI, script varchar(40) )
set nocount on
INSERT INTO #temp values(N'かか7','hiragana no accent')
INSERT INTO #temp values(N'がが6','hiragana with accent')
INSERT INTO #temp values(N'いい5','earlier letter hiragana no accent')
INSERT INTO #temp values(N'カカ4','katakana no accent')
INSERT INTO #temp values(N'ガガ3','katakana with accent')
INSERT INTO #temp values(N'かか2','hiragana no accent')
INSERT INTO #temp values(N'がが1','hiragana with accent')
select * from #temp order by title
Hiragana and katakana are different scripts becomes interchangeable and only the trailing numbers sort.
title script
いい5 earlier letter hiragana no accent
がが1 hiragana with accent
かか2 hiragana no accent
ガガ3 katakana with accent
カカ4 katakana no accent
がが6 hiragana with accent
かか7 hiragana no accent
Edit (3)
The collation in the table definition will affect search results.
You can even dynamically change collation while sorting to be different than that.
Japanese_90_CI_AI: behaves (ignores kana, ignores accent)
Japanese_90_CI_AS: behaves (ignores kana, sorts accents)
Japanese_90_CS_AI: doesn't affect kana-type
Japanese_90_CI_AI_KS : behaves (ignores accent, sorts kana)
Japanese_BIN2: behaves (sorts kana, sorts accent)
Swap out the COLLATE below to get a better view of what to expect
declare #temp as table (title nvarchar(5) COLLATE Japanese_90_CI_AS, script varchar(40) )
set nocount on
INSERT INTO #temp values(N'い','earlier letter hiragana no accent')
INSERT INTO #temp values(N'か','hiragana no accent')
INSERT INTO #temp values(N'カ','katakana no accent')
INSERT INTO #temp values(N'ガ','katakana with accent')
INSERT INTO #temp values(N'が','hiragana with accent');
select a.title, b.title, case when a.title = b.title then 'True' else 'False' end search_equivalent
from #temp a join #temp b on 1=1
In the end I'm look for a kana type insensitive, case sensitive
sorting, while searching should be kana type insensitive and case
insensitive
For : searching should be kana type insensitive and case insensitive:
declare #temp as table (title nvarchar(5) COLLATE Japanese_90_CI_AI, script varchar(40) )
For: sorting
-- Sorting is predictable using the collation above, predictable, but not what you originally expected.
order by title COLLATE Japanese_90_CS_AI_KS
More (Final - v1)
Looking at your desired output:
(treat kata as hira)
(differentiate accent)
(unicode order)
Since there's no COLLATION for that -- From #Lukasz Szozda's Answer
Here's one possible workaround:
declare #temp as table (title nvarchar(5), expected int, script varchar(40))
set nocount on
INSERT INTO #temp values(N'かか7', 4,'hiragana no accent');
INSERT INTO #temp values(N'がが6',7,'hiragana with accent');
INSERT INTO #temp values(N'いい5',1,'earlier letter hiragana no accent');
INSERT INTO #temp values(N'カカ4',3, 'katakana no accent');
INSERT INTO #temp values(N'ガガ3',6, 'katakana with accent');
INSERT INTO #temp values(N'かか2',2, 'hiragana no accent');
INSERT INTO #temp values(N'がが1', 5, 'hiragana with accent');
select * from #temp order by dbo.kata2hira(title) COLLATE Latin1_General_BIN2;
title expected script
いい5 1 earlier letter hiragana no accent
かか2 2 hiragana no accent
カカ4 3 katakana no accent
かか7 4 hiragana no accent
がが1 5 hiragana with accent
ガガ3 6 katakana with accent
がが6 7 hiragana with accent
CREATE OR ALTER FUNCTION dbo.Kata2Hira (#kana NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING
AS
BEGIN
WITH filler AS (
select value FROM STRING_SPLIT('1,1,1,1,1,1,1,1,1,1',',')
), Hira AS (
-- 3040 - 309F (Hiragana)
SELECT TOP (96) 0x3040 + ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS [N]
FROM filler x JOIN filler y on 1=1
), Kata AS (
-- 30A0 - 30FF (Katakana)
SELECT TOP (96) 0x30A0 + ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS [N]
FROM filler x JOIN filler y on 1=1
)
SELECT #kana = REPLACE(#kana, NCHAR(Kata.[N]), NCHAR(Hira.[N]))
FROM Hira JOIN Kata ON NCHAR(Hira.[N]) COLLATE JAPANESE_CS_AS = NCHAR(Kata.[N]) COLLATE JAPANESE_CS_AS
RETURN #kana;
END;
GO
Post workaround thoughts:
Just a misc FYI while I was checking things out, it doesn't appear unique to non-japanese characters.
Replace 2 and 1 with い, and あ, and the default collation behaves consistently.
-- Implied Kana Insensitive vs JAPANESE_CS_AS_KS
select case when N'カ' COLLATE JAPANESE_CS_AS < N'ガ' COLLATE JAPANESE_CS_AS then 'True' else 'False' end as want_true union all
select case when N'カ2' COLLATE JAPANESE_CS_AS < N'ガ1' COLLATE JAPANESE_CS_AS then 'True' else 'False' end union all
select case when N'カb' COLLATE JAPANESE_CS_AS < N'ガa' COLLATE JAPANESE_CS_AS then 'True' else 'False' end union all
select case when N'カい' COLLATE JAPANESE_CS_AS < N'ガあ' COLLATE JAPANESE_CS_AS then 'True' else 'False' end union all
select case when N'1カい' COLLATE JAPANESE_CS_AS < N'2ガあ' COLLATE JAPANESE_CS_AS then 'True' else 'False' end -- sanity check to make sure we're not reading right to left...haha
want_true
True
False
False
False
True
The why would be really interesting to know...
The plausible theory: accent sensitivity ordering is applied as a second pass after kana insensitivity over the field (instead of per character).

How to search case sensitive string in MS Sql server?

how to search case sensitive data like user_name and password in ms SQL server.
In MySQl It is done by BINARY() function.

Create column with case sensitive collate, and try this:
Query:
DECLARE #temp TABLE
(
Name VARCHAR(50) COLLATE Latin1_General_CS_AS
)
INSERT INTO #temp (Name)
VALUES
('Ankit Kumar'),
('DevArt'),
('Devart')
SELECT *
FROM #temp
WHERE Name LIKE '%Art'
Output:
DevArt
Or try this similar code -
DECLARE #temp TABLE (Name NVARCHAR(50))
INSERT INTO #temp (Name)
VALUES
('Ankit Kumar'),
('DevArt'),
('Devart')
SELECT *
FROM #temp
WHERE Name LIKE '%Art' COLLATE Latin1_General_CS_AS

Can do this using Casting to Binary
SELECT * FROM UsersTable
WHERE
CAST(Username as varbinary(50)) = CAST(#Username as varbinary(50))
AND CAST(Password as varbinary(50)) = CAST(#Password as varbinary(50))
AND Username = #Username
AND Password = #Password

contains search over a table variable or a temp table

i'm trying to concatenate several columns from a persistent table into one column of a table variable, so that i can run a contains("foo" and "bar") and get a result even if foo is not in the same column as bar.
however, it isn't possible to create a unique index on a table variable, hence no fulltext index to run a contains.
is there a way to, dynamically, concatenate several columns and run a contains on them? here's an example:
declare #t0 table
(
id uniqueidentifier not null,
search_text varchar(max)
)
declare #t1 table ( id uniqueidentifier )
insert into
#t0 (id, search_text)
select
id,
foo + bar
from
description_table
insert into
#t1
select
id
from
#t0
where
contains( search_text, '"c++*" AND "programming*"' )

You cannot use CONTAINS on a table that has not been configured to use Full Text Indexing, and that cannot be applied to table variables.
If you want to use CONTAINS (as opposed to the less flexible PATINDEX) you will need to base the whole query on a table with a FT index.

You can't use full text indexing on a table variable but you can apply the full text parser. Would something like this do what you need?
declare #d table
(
id int identity(1,1),
testing varchar(1000)
)
INSERT INTO #D VALUES ('c++ programming')
INSERT INTO #D VALUES ('c# programming')
INSERT INTO #D VALUES ('c++ books')
SELECT id
FROM #D
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(testing,'"','""') + '"', 1033, 0,0)
where display_term in ('c++','programming')
GROUP BY id
HAVING COUNT(DISTINCT display_term)=2
NB: There might well be a better way of using the parser but I couldn't quite figure it out. Details of it are at this link

declare #table table
(
id int,
fname varchar(50)
)
insert into #table select 1, 'Adam James Will'
insert into #table select 1, 'Jain William'
insert into #table select 1, 'Bob Rob James'
select * from #table where fname like '%ja%' and fname like '%wi%'
Is it something like this.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

New database: should it be accent sensitive? - sql-server

You can change collation at select time: with t as ( select 'ali' as w union select 'àli' as w ) select * into #t from t; select * from #t where w collate Latin1_General_CS_AS_KS_WS like '%à%' w --- àli select * from #t where w collate Traditional_Spanish_ci_ai like '%à%' w --- ali àli

Related

Why SQL Server has error when i use LIKE operator?

Remove the last value after a comma in nvarchar SQL

SQL sort order in Japanese breaks when text includes non-Japanese characters

How to search case sensitive string in MS Sql server?

contains search over a table variable or a temp table

Categories

Resources