SQL Server 2014 inconsistent behavior when handling unicode - sql-server

My goal is to enforce case-insensitive uniqueness on a column that may contain unicode characters on any unicode planes. But I encountered some inconsistent behavior when testing against SQL Server 2014(build 12.0.4213.0). Here is the test:
CREATE TABLE unicodetest1 (
id int,
/* this collation is supposed to support all supplementary unicode characters */
a nvarchar(10) COLLATE Latin1_General_100_CI_AS_SC
);
CREATE UNIQUE INDEX idx_a_1 ON unicodetest1(a);
I can insert many single unicode characters, both on BMPs and SMP (Plane 1):
INSERT INTO unicodetest1(id, a) VALUES(1, 't'); -- U+0074
INSERT INTO unicodetest1(id, a) VALUES (100, N'πŸ‘Έ'); -- U+1F478
INSERT INTO unicodetest1(id, a) VALUES (101, N'πŸ‘Ή'); -- U+1F479
INSERT INTO unicodetest1(id, a) VALUES (104, N'⭐'); -- U+2b50
SELECT id, a, len(a) FROM unicodetest1;
The length of the single character strings is 1, which is expected.
When I concatenate some unicode characters to ascii strings, it still works:
INSERT INTO unicodetest1(id, a) VALUES (111, N'test'); -- ok
INSERT INTO unicodetest1(id, a) VALUES (112, N'testπŸ‘Έ'); -- ok
INSERT INTO unicodetest1(id, a) VALUES (113, N'testπŸ‘Ή'); -- ok
So far so good.
But, when I insert N'test⭐', I get a duplicate key violation:
INSERT INTO unicodetest1(id, a) VALUES (114, N'test⭐'); -- dup
Msg 2601, Level 14, State 1, Line 21
Cannot insert duplicate key row in object 'dbo.unicodetest1' with unique index 'idx_a_1'. The duplicate key value is (test⭐).
The character "⭐" has a code point U+2b50, it is on BMP!
It seems that SQL Server thinks N'test⭐' and N'test' are identical!
My question is: what makes "⭐" special that it violates uniqueness constraint when it is concatenated to a string?
Thanks
James

Related

'String or binary data would be truncated' without any data exceeding the length

Yesterday suddenly a report occurred that someone was not able to get some data anymore because the issue Msg 2628, Level 16, State 1, Line 57 String or binary data would be truncated in table 'tempdb.dbo.#BC6D141E', column 'string_2'. Truncated value: '!012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678'. appeared.
I was unable to create a repro without our tables. This is the closest as I can get to:
-- Create temporary table for results
DECLARE #results TABLE (
string_1 nvarchar(100) NOT NULL,
string_2 nvarchar(100) NOT NULL
);
CREATE TABLE #table (
T_ID BIGINT NULL,
T_STRING NVARCHAR(1000) NOT NULL
);
INSERT INTO #table VALUES
(NULL, '0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789'),
(NULL, '!0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789!');
WITH abc AS
(
SELECT
'' AS STRING_1,
t.T_STRING AS STRING_2
FROM
UT
INNER JOIN UTT ON UTT.UT_ID = UT.UT_ID
INNER JOIN MV ON MV.UTT_ID = UTT.UTT_ID
INNER JOIN OT ON OT.OT_ID = MV.OT_ID
INNER JOIN #table AS T ON T.T_ID = OT.T_ID -- this will never get hit because T_ID of #table is NULL
)
INSERT INTO #results
SELECT STRING_1, STRING_2 FROM abc
ORDER BY LEN(STRING_2) DESC
DROP TABLE #table;
As you can see the join of #table cannot yield any results because all T_ID are NULL nevertheless I am getting the error mentioned above. The result set is empty.
That would be okay if a text with more than 100 characters would be in the result set but that is not the case because it is empty. If I remove the INSERT INTO #results and display the results it does not contain any text with more than 100 characters. The ORDER BY was only used to determine the faulty text value (with the original data).
When I use SELECT STRING_1, LEFT(STRING_2, 100) FROM abc it does work but it does not contain the text either that is meant to be truncated.
Therefore: What am I missing? Is it a bug of SQL Server?
-- this will never get hit is a bad assumption. It is well known and documented that SQL Server may try to evaluate parts of your query before it's obvious that the result is impossible.
A much simpler repro (from this post and this db<>fiddle):
CREATE TABLE dbo.t1(id int NOT NULL, s varchar(5) NOT NULL);
CREATE TABLE dbo.t2(id int NOT NULL);
INSERT dbo.t1 (id, s) VALUES (1, 'l=3'), (2, 'len=5'), (3, 'l=3');
INSERT dbo.t2 (id) VALUES (1), (3), (4), (5);
GO
DECLARE #t table(dest varchar(3) NOT NULL);
INSERT #t(dest) SELECT t1.s
FROM dbo.t1
INNER JOIN dbo.t2 ON t1.id = t2.id;
Result:
Msg 2628, Level 16, State 1
String or binary data would be truncated in table 'tempdb.dbo.#AC65D70E', column 'dest'. Truncated value: 'len'.
While we should have only retrieved rows with values that fit in the destination column (id is 1 or 3, since those are the only two rows that match the join criteria), the error message indicates that the row where id is 2 was also returned, even though we know it couldn't possibly have been.
Here's the estimated plan:
This shows that SQL Server expected to convert all of the values in t1 before the filter eliminated the longer ones. And it's very difficult to predict or control when SQL Server will process your query in an order you don't expect - you can try with query hints that attempt to either force order or to stay away from hash joins but those can cause other, more severe problems later.
The best fix is to size the temp table to match the source (in other words, make it large enough to fit any value from the source). The blog post and db<>fiddle explain some other ways to work around the issue, but declaring columns to be wide enough is the simplest and least intrusive.

How to make SQL server recognize unique Thai characters?

Here's my table:
id name
1 αž‘αžΉαž€ αž€αžΆαž”αžΌαž“
2 αž›αžΈαž’αžΌ αž”αŸ€αžš
3 αžŸαŸ’αžšαž”αŸ€αžš ្ៀ
4 αžŸαŸ’αžšαžΆαž”αŸ€αžš αžŒαŸ’αžšαžΆαž”αŸ‹
When I query using this statement: SELECT * FROM t1 WHERE name = N'αž›αžΈαž’αžΌ αž”αŸ€αžš', it returns all rows. This is weird since those Thai characters are different.
Also, this will cause an issue if I make the name column as unique. Does anyone encounter the same issue and what is the possible fix? I tried changing the collation but still no avail.
How to make SQL server recognize unique Thai characters?
Bing translate identifies those characters as Khmer, not Thai. So you need to pick a collation that has language-specific rules for those characters, eg
drop table if exists t1
create table t1(id int, name nvarchar(200) collate Khmer_100_CI_AI )
insert into t1(id,name)
values (1, N'αž‘αžΉαž€ αž€αžΆαž”αžΌαž“'),(2, N'αž›αžΈαž’αžΌ αž”αŸ€αžš'),(3, N'αžŸαŸ’αžšαž”αŸ€αžš ្ៀ'),(4, N'αžŸαŸ’αžšαžΆαž”αŸ€αžš αžŒαŸ’αžšαžΆαž”αŸ‹')
SELECT * FROM t1 WHERE name = N'αž›αžΈαž’αžΌ αž”αŸ€αžš'
Or use binary collation, that simply compares the characters by their code point values. eg
drop table if exists t1
create table t1(id int, name nvarchar(200) collate Latin1_General_100_BIN2 )
insert into t1(id,name)
values (1, N'αž‘αžΉαž€ αž€αžΆαž”αžΌαž“'),(2, N'αž›αžΈαž’αžΌ αž”αŸ€αžš'),(3, N'αžŸαŸ’αžšαž”αŸ€αžš ្ៀ'),(4, N'αžŸαŸ’αžšαžΆαž”αŸ€αžš αžŒαŸ’αžšαžΆαž”αŸ‹')
SELECT * FROM t1 WHERE name = N'αž›αžΈαž’αžΌ αž”αŸ€αžš'
Even some of the newer Latin collations will work, eg
drop table if exists t1
create table t1(id int, name nvarchar(200) collate Latin1_General_100_CI_AI )
insert into t1(id,name)
values (1, N'αž‘αžΉαž€ αž€αžΆαž”αžΌαž“'),(2, N'αž›αžΈαž’αžΌ αž”αŸ€αžš'),(3, N'αžŸαŸ’αžšαž”αŸ€αžš ្ៀ'),(4, N'αžŸαŸ’αžšαžΆαž”αŸ€αžš αžŒαŸ’αžšαžΆαž”αŸ‹')
SELECT * FROM t1 WHERE name = N'αž›αžΈαž’αžΌ αž”αŸ€αžš'

t-sql insert statement where primary key is a decimal

2 years ago I created a table that has 22 rows. Each row is a step/page in filing an application for hire. I realized back then I would most likely be asked to insert steps as the business grew. I was right. I need to insert a new step between step 21 & 22. So I want to create a new row in that table with stepId = 21.5. But the insert statement fails.
INSERT INTO frznStep (
stepId
,myField1
,myField2
,myField3
)
VALUES (
21.5
,'xxx'
,'yyy'
,'zzz'
)
the error msg is:
Violation of PRIMARY KEY constraint 'PK_frznStep'. Cannot insert
duplicate key in object 'dbo.frznStep'. The duplicate key value is
(22).
I suspect when you script out the table, you'll see that the precision of your decimal column is 0, so something like stepId decimal(9,0)
If you have a non-zero value for the decimal precision, the following repro works
USE tempdb
DROP TABLE IF EXISTS #frznStep;
CREATE TABLE #frznStep
(
stepId decimal(9, 1) NOT NULL
, myField1 varchar(255) NOT NULL
, myField2 varchar(255) NOT NULL
, myField3 varchar(255) NOT NULL
, CONSTRAINT PK_frznStep PRIMARY KEY (stepId)
);
insert into #frznStep (stepId, myField1, myField2, myField3) values (21, 'www', 'yyy', 'zzz');
insert into #frznStep (stepId, myField1, myField2, myField3) values (22, 'yyy', 'yyy', 'zzz');
insert into #frznStep (stepId, myField1, myField2, myField3) values (21.5, 'xxx', 'yyy', 'zzz');
GO
When you use a 0 scale, you'll get 21 and 22 into the table but 21.5 will be implicitly converted to decimal(x,0) which then violates the primary key constraint.
-- Redeclare as 0 precision
DROP TABLE IF EXISTS #frznStep;
CREATE TABLE #frznStep
(
stepId decimal(9, 0) NOT NULL
, myField1 varchar(255) NOT NULL
, myField2 varchar(255) NOT NULL
, myField3 varchar(255) NOT NULL
, CONSTRAINT PK_frznStep PRIMARY KEY (stepId)
);
insert into #frznStep (stepId, myField1, myField2, myField3) values (21, 'www', 'yyy', 'zzz');
insert into #frznStep (stepId, myField1, myField2, myField3) values (22, 'yyy', 'yyy', 'zzz');
--Msg 2627, Level 14, State 1, Line 36
--Violation of PRIMARY KEY constraint 'PK_frznStep'. Cannot insert duplicate key in object 'dbo.#frznStep'. The duplicate key value is (22).
--The statement has been terminated.
insert into #frznStep (stepId, myField1, myField2, myField3) values (21.5, 'xxx', 'yyy', 'zzz');
You options are either to change your data type to include the scale (which will require dropping and recreating the primary key as the column is part of it) Or scale everything up by a factor of 10 and then you can insert into the 215 nicely between 210 and 220. (A "trick" I learned the hard way programming Apple Basic ages ago)
The first intuitive answer off the bat would be to convert your primary key to a numeric type, such as decimal.
However, is there really a reason to think of the step as 21.5? Or are you just trying to fit it between 21 and 22? I say this because the more ideal situation would be to have a primary key that simply serves as an identity. Then have a separate column that identifies the step number. That way, instead of having the step be 21.5, you'll simply have it be step 22, and then you can change step 22 to step 23.
alter table frznStep add column stepOrd int null;
update frznStep set stepOrd = stepId;
update frznStep set stepOrd = 23 where stepOrd = 22;
insert frznStep (stepId, stepOrd, ...) values (100, 22, ...);
You could also convert stepId to autoincrement. Though I believe you'll have to drop the table and recreate it in that case.
You’re getting the error because your id column is effectively integer and your attempted insert value is being rounded to integer, thus colliding with an existing key value.
Rather than using the id column as both unique identifier and step order, which is a design flaw (overloading a column), specify the steps as a chain, like a linked list, by introducing a column, perhaps called nextStepId, that stores the id of the next step to run.
This would separate the concerns of primary key being the row identifier and step order, giving control of step order independent of id values being any particular value relative to each other.

Trailing spaces allowed in foreign keys

Issue: SQL Server allows trailing spaces to be added to a foreign key!
This behaviour of course leads to various unwanted behaviour in the application. How can this be stopped?
Example: Two tables in a 1:n relationship:
create table products
(
pid nvarchar(20) primary key
;)
create table sales
(
pid nvarchar(20) references products(pid),
units int
);
Now insert primary key 'A':
insert into products (pid) values ('A');
Now insert foreign keys:
-- 'A' is accepted, as expected:
insert into sales (pid, units) values ('A', 23);
-- 'B' is declined, as expected:
insert into sales (pid, units) values ('B', 12);
-- 'A ' (with a trailing space)
-- This is ACCEPTED, but of course this is NOT EXPECTED !!
insert into sales (pid, units) values ('A ', 12);
A second issue is that this is really hard to detect since :
select pid from sales group by pid
returns only one value: A in your example
Here is a trick to help detect the issue:
select pid from sales group by binary(pid)
This returns 2 rows: A and A (with trailing space)
Cheers,
If you just plain don't want to allow trailing spaces:
create table sales
(
pid nvarchar(20) references products(pid),
units int,
constraint CK_sales_pid CHECK (RIGHT(pid,1) <> ' ')
);
Otherwise, you need to realise that this is not just a single "unexpected" situation. The SQL Standard says that when there are two strings with unequal lengths, the shorter string is first padded with spaces to make the lengths equal, before comparisons occur.

Does Sybase support string types that it doesn't right trim?

The accepted answer to this question claims that using a char column will preserve all filling spaces in a string in Sybase, but that is not the behavior I am seeing (unless I'm mistaken about what 'filling spaces' means). For example, when I run the following script:
create table #check_strings
(
value char(20) null
)
insert into #check_strings values (null)
insert into #check_strings values ('')
insert into #check_strings values (' ')
insert into #check_strings values (' ')
insert into #check_strings values ('blah')
insert into #check_strings values ('blah ')
insert into #check_strings values ('blah ')
insert into #check_strings values (' blah ')
select value, len(value) from #check_strings
I get the following output:
[NULL] - [NULL]
[single space] - 1
[single space] - 1
[single space] - 1
blah - 4
blah - 4
blah - 4
[four spaces]blah - 8
Is there any way to get Sybase to not strip off trailing spaces? Additionally, is there any way to get it to save an empty string as an empty string, and not as a single space?
I'm using Sybase 15.5.
To have fixed length on char columns, you must use the not null attribute.
For variable length columns, use varchar or char with a null attribute.
Then to measure the real data size use the datalength function not the len function nor the charlength function.
See http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.ase_15.0.sqlug/html/sqlug/sqlug208.htm
"When you create a char, unichar, or nchar column that allows nulls, Adaptive Server converts it to a varchar, univarchar, or nvarchar column and uses the storage rules for those datatypes."
Please, check not null column.

Resources