Unicode characters causing issues in SQL Server 2005 string comparison - sql-server

This query:
select *
from op.tag
where tag = 'fussball'
Returns a result which has a tag column value of "fußball". Column "tag" is defined as nvarchar(150).
While I understand they are similar words grammatically, can anyone explain and defend this behavior? I assume it is related to the same collation settings which allow you to change case sensitivity on a column/table, but who would want this behavior? A unique constraint on the column also causes failure on inserts of one value when the other exists due to a constraint violation. How do I turn this off?
Follow-up bonus point question. Explain why this query does not return any rows:
select 1
where 'fußball' = 'fussball'
Bonus question (answer?): #ScottCher pointed out to me privately that this is due to the string literal "fussball" being treated as a varchar. This query DOES return a result:
select 1
where 'fußball' = cast('fussball' as nvarchar)
But then again, this one does not:
select 1
where cast('fußball' as varchar) = cast('fussball' as varchar)
I'm confused.

I guess the Unicode collation set for your connection/table/database specifies that ss == ß. The latter behavior would be because it's on a faulty fast path, or maybe it does a binary comparison, or maybe you're not passing in the ß in the right encoding (I agree it's stupid).
http://unicode.org/reports/tr10/#Searching mentions that U+00DF is special-cased. Here's an insightful excerpt:
Language-sensitive searching and
matching are closely related to
collation. Strings that compare as
equal at some strength level are those
that should be matched when doing
language-sensitive matching. For
example, at a primary strength, "ß"
would match against "ss" according to
the UCA, and "aa" would match "å" in a
Danish tailoring of the UCA.

The SELECT does return a row with collation Latin1_General_CI_AS (SQL2000).
It does not with collation Latin1_General_BIN.
You can assign a table column a collation by using the COLLATE < collation > keyword after N/VARCHAR.
You can also compare strings with a specific collation using the syntax
string1 = string2 COLLATE < collation >

This isn't an answer that explains behavior, but may be relevant:
In this question, I learned that using the collation of
Latin1_General_Bin
will avoid most collation quirks.

Some helper answers - not the complete one to your question, but still maybe helpful:
If you try:
SELECT 1 WHERE N'fußball' = N'fussball'
you'll get "1" - when using the "N" to signify Unicode, the two strings are considered the same - why that's the case, I don't know (yet).
To find the default collation for a server, use
SELECT SERVERPROPERTY('Collation')
To find the collation of a given column in a database, use this query:
SELECT
name 'Column Name',
OBJECT_NAME(object_id) 'Table Name',
collation_name
FROM sys.columns
WHERE object_ID = object_ID('your-table-name')
AND name = 'your-column-name'

Bonus question (answer?): #ScottCher
pointed out to me privately that this
is due to the string literal
"fussball" being treated as a varchar.
This query DOES return a result:
select 1 where 'fußball' =
cast('fussball' as nvarchar)
Here you're dealing with the SQL Server data type precedence rules, as stated in Data Type Precedence. Comparisons are done always using the higher precedence type:
When an operator combines two
expressions of different data types,
the rules for data type precedence
specify that the data type with the
lower precedence is converted to the
data type with the higher precedence.
Since nvarchar has a higher precedence than varchar, the comparison in your example will occur suing the nvarchar type, so it's really exactly the same as select 1 where N'fußball' =N'fussball' (ie. using Unicode types). I hope this also makes it clear why your last case doesn't return any row.

Related

SQL Server returns wrong result with trailing spaces in Where clause [duplicate]

In SQL Server 2008 I have a table called Zone with a column ZoneReference varchar(50) not null as the primary key.
If I run the following query:
select '"' + ZoneReference + '"' as QuotedZoneReference
from Zone
where ZoneReference = 'WF11XU'
I get the following result:
"WF11XU "
Note the trailing space.
How is this possible? If the trailing space really is there on that row, then I'd expect to return zero results, so I'm assuming it's something else that SQL Server Management Studio is displaying weirdly.
In C# code calling zoneReference.Trim() removes it, suggesting it is some sort of whitespace character.
Can anyone help?
That's the expected result: in SQL Server the = operator ignores trailing spaces when making the comparison.
SQL Server follows the ANSI/ISO SQL-92 specification (Section 8.2, , General rules #3) on how to compare strings with spaces. The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them. The padding directly affects the semantics of WHERE and HAVING clause predicates and other Transact-SQL string comparisons. For example, Transact-SQL considers the strings 'abc' and 'abc ' to be equivalent for most comparison operations.
The only exception to this rule is the LIKE predicate. When the right side of a LIKE predicate expression features a value with a trailing space, SQL Server does not pad the two values to the same length before the comparison occurs. Because the purpose of the LIKE predicate, by definition, is to facilitate pattern searches rather than simple string equality tests, this does not violate the section of the ANSI SQL-92 specification mentioned earlier.
Source
Trailing spaces are not always ignored.
I experienced this issue today. My table had NCHAR columns and was being joined to VARCHAR data.
Because the data in the table was not as wide as its field, trailing spaces were automatically added by SQL Server.
I had an ITVF (inline table-valued function) that took varchar parameters.
The parameters were used in a JOIN to the table with the NCHAR fields.
The joins failed because the data passed to the function did not have trailing spaces but the data in the table did. Why was that?
I was getting tripped up on DATA TYPE PRECEDENCE. (See http://technet.microsoft.com/en-us/library/ms190309.aspx)
When comparing strings of different types, the lower precedence type is converted to the higher precedence type before the comparison. So my VARCHAR parameters were converted to NCHARs. The NCHARs were compared, and apparently the spaces were significant.
How did I fix this? I changed the function definition to use NVARCHAR parameters, which are of a higher precedence than NCHAR. Now the NCHARs were changed automatically by SQL Server into NVARCHARs and the trailing spaces were ignored.
Why didn't I just perform an RTRIM? Testing revealed that RTRIM killed the performance, preventing the JOIN optimizations that SQL Server would have otherwise used.
Why not change the data type of the table? The tables are already installed on customer sites, and they do not want to run maintenance scripts (time + money to pay DBAs) or give us access to their machinines (understandable).
Yeah, Mark is correct. Run the following SQL:
create table #temp (name varchar(15))
insert into #temp values ('james ')
select '"' + name + '"' from #temp where name ='james'
select '"' + name + '"' from #temp where name like 'james'
drop table #temp
But, the assertion about the 'like' statement appears not to work in the above example. Output:
(1 row(s) affected)
-----------------
"james "
(1 row(s) affected)
-----------------
"james "
(1 row(s) affected)
EDIT:
To get it to work, you could put at the end:
and name <> rtrim(ltrim(name))
Ugly though.
EDIT2:
Given the comments abovem, the following would work:
select '"' + name + '"' from #temp where 'james' like name
try
select Replace('"' + ZoneReference + '"'," ", "") as QuotedZoneReference from Zone where ZoneReference = 'WF11XU'

Does Snowflake support case insensitive where clause filter similar to SQL Server

I am migrating SQL code to snowflake and during migration i found that by default snowflake is comparing varchar field (ex. select 1 where 'Hello' = 'hello') incorrectly. To solve this problem i set collation 'en-ci' at account level. However now i am not able to use REPLACE like crucial function.
Is it possible in snowflake to do case insensitive varchar comparison (without mentioning collation explicitly or using UPPER function every time) and still use replace function?
I will appreciate any help.
Thanks,
you can compare the test with a case intensive match via ILIKE
select 1 where 'Hello' ilike 'hello'
regexp_replace is your friend, it allows for 'i' parameter that stands for "ignore case":
https://docs.snowflake.com/en/sql-reference/functions/regexp_replace.html
For example you can do something like that:
select regexp_replace('cats are grey, cAts are Cats','cats','dogs',1,0,'i');
I've assumed the default values for position and occurrence but those can also be adjusted
And you can still do comparison (also based on regexp, aka "RLIKE"):
https://docs.snowflake.com/en/sql-reference/functions/rlike.html
Snowflake supports COLLATE:
SELECT 1 WHERE 'Hello' = 'hello' COLLATE 'en-ci';
-- 1
SELECT 'Hello' = 'hello'
,'Hello' = 'hello' COLLATE 'en-ci';
Output:
The collation could be setup at account/database/schema/table level with parameter DEFAULT_DDL_COLLATION:
Sets the default collation used for the following DDL operations:
CREATE TABLE
ALTER TABLE … ADD COLUMN
Setting this parameter forces all subsequently-created columns in the affected objects (table, schema, database, or account) to have the specified collation as the default, unless the collation for the column is explicitly defined in the DDL.

attribute key was not found - MS SQL Server

I got the following message on MS SQL Server (I'm translating from German):
"Table 'VF_Fact', column ORGUNIT_CD, Value: 1185. The attribute is
ORGUNIT_CD. Row was dropped because attribute key was not found.
Attribute: ORGUNIT_CD in the dimension 'Organization' from database
'Dashboard', Cube 'Box Cube'..."
I checked the fact table 'VF_Fact' and the column ORGUNIT_CD - there I was able to found the value '1185'. The column ORGUNIT_CD is defined as follows in the view:
CAST( COALESCE( emp.ORGUNIT_CD, 99999999 ) AS char(8)) AS ORGUNIT_CD,
In addition the view retrieves the column from L_Employee_SAP TABLE, where ORGUNIT_CD is defined as follows:
[ORGUNIT_CD] [char](8) NOT NULL,
AND the value I find here is not '1185' but '00001185'.
The Fact table 'VF_Fact' is connected with the table L_ORG in which the column ORGUNIT_CD is defined as follows:
[ORGUNIT_CD] [char](8) NOT NULL,
This table hast the following value in the ORGUNIT_CD column: '00001185'.
Can anyone please explain, why am i getting this error, and how to remove it?
From this answer:
COALESCE:
Return Types
Returns the data type of expression with the highest data type precedence. If all expressions are nonnullable, the result is typed
as nonnullable.
(Emphasis added). int had a higher precedence than varchar, so
the return type of your COALESCE must be of type int. And obviously,
your varchar value cannot be so converted.
As another answer noted, ISNULL() behaves differently: rather than return the data type with the highest precedence, it returns the data type of the first value (thus, #Aleem's answer would solve your issue). A more detailed explanation can be found here under the section "Data Type of Expression."
In your specific case, I'd actually recommend that you encase the alternative string in single quotes, thus tipping SQL Server off to the fact that you intend this to be a character field. This means your expression would be one of the following:
CAST (ISNULL( emp.ORGUNIT_CD, '99999999' ) as char(8))
CAST (COALESCE( emp.ORGUNIT_CD, '99999999' ) AS char(8))
The advantage of using quotes in this situation? If you (or another developer) comes back to this down the line and tries to change it to COALESCE() or do any other type of modification, it's still going to work without breaking anything, because you told SQL Server what data type you want to use in the string itself. Depending on what else you're trying to do, you might even be able to remove the CAST() statement entirely.
COALESCE( emp.ORGUNIT_CD, '99999999' )
COALESCE function is dropping the leading zeroes. If you are checking for nulls you can do this and it will keep the zeroes.
CAST (ISNULL( emp.ORGUNIT_CD, 99999999 ) as char(8))

Microsoft SQL Server collation names

Does anybody know what the WS property of a collation does? Does it have anything to do with Asian type of scripts? The MSDN docs explain it to be "Width Sensitive", but that doesn't make any sense for say Swedish, or English...?
A good description of width sensitivity is summarized here: http://www.databasejournal.com/features/mssql/article.php/3302341/SQL-Server-and-Collation.htm
Width sensitivity
When a single-byte character
(half-width) and the same character
when represented as a double-byte
character (full-width) are treated
differently then it is width
sensitive.
Perhaps from an English character perspective, I would theorize that a width-sensitive collation would mean that 'abc' <> N'abc', because one string is a Unicode string (2 bytes per character), whereas the other one byte per character.
From a Latin characterset perspective it seems like something that wouldn't make sense to set. Perhaps in other languages this is important.
I try to set these types of collation properties to insensitive in general in order to avoid weird things like records not getting returned in search results. I usually keep accents set to insensitive, since that can cause a lot of user search headaches, depending on the audience of your applications.
Edit:
After creating a test database with the Latin1_General_CS_AS_WS collation, I found that the N'a' = N'A' is actually true. Test queries were:
select case when 'a' = 'A' then 'yes' else 'no' end
select case when 'a' = 'a' then 'yes' else 'no' end
select case when N'a' = 'a' then 'yes' else 'no' end
So in practice I'm not sure where this type of rule comes into play
The accepted answer demonstrates that it does not come into play for the comparison N'a' = 'a'. This is easily explained because the char will get implicitly converted to nchar in the comparison between the two anyway so both strings in the comparison are Unicode.
I just thought of an example of a place where width sensitivity might be expected to come into play in a Latin Collation only to discover that it appeared to make no difference at all there either...
DECLARE #T TABLE (
a VARCHAR(2) COLLATE Latin1_General_100_CS_AS_WS,
b VARCHAR(2) COLLATE Latin1_General_100_CS_AS_WS )
INSERT INTO #T
VALUES (N'Æ',
N'AE');
SELECT LEN(a) AS [LEN(a)],
LEN(b) AS [LEN(b)],
a,
b,
CASE
WHEN a = b THEN 'Y'
ELSE 'N'
END AS [a=b]
FROM #T
LEN(a) LEN(b) a b a=b
----------- ----------- ---- ---- ----
1 2 Æ AE Y
The Book "Microsoft SQL Server 2008 Internals" has this to say.
Width Sensitivity refers to East Asian
languages for which there exists both
half-width and full-width forms of
some characters.
There is absolutely nothing stopping you storing these characters in a collation such as Latin1_General_100_CS_AS_WS as long as the column has a unicode data type so I guess that the WS part would only apply in that particular situation.

How can I make SQL Server return FALSE for comparing varchars with and without trailing spaces?

If I deliberately store trailing spaces in a VARCHAR column, how can I force SQL Server to see the data as mismatch?
SELECT 'foo' WHERE 'bar' = 'bar '
I have tried:
SELECT 'foo' WHERE LEN('bar') = LEN('bar ')
One method I've seen floated is to append a specific character to the end of every string then strip it back out for my presentation... but this seems pretty silly.
Is there a method I've overlooked?
I've noticed that it does not apply to leading spaces so perhaps I run a function which inverts the character order before the compare.... problem is that this makes the query unSARGable....
From the docs on LEN (Transact-SQL):
Returns the number of characters of the specified string expression, excluding trailing blanks. To return the number of bytes used to represent an expression, use the DATALENGTH function
Also, from the support page on How SQL Server Compares Strings with Trailing Spaces:
SQL Server follows the ANSI/ISO SQL-92 specification on how to compare strings with spaces. The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them.
Update: I deleted my code using LIKE (which does not pad spaces during comparison) and DATALENGTH() since they are not foolproof for comparing strings
This has also been asked in a lot of other places as well for other solutions:
SQL Server 2008 Empty String vs. Space
Is it good practice to trim whitespace (leading and trailing)
Why would SqlServer select statement select rows which match and rows which match and have trailing spaces
you could try somethign like this:
declare #a varchar(10), #b varchar(10)
set #a='foo'
set #b='foo '
select #a, #b, DATALENGTH(#a), DATALENGTH(#b)
Sometimes the dumbest solution is the best:
SELECT 'foo' WHERE 'bar' + 'x' = 'bar ' + 'x'
So basically append any character to both strings before making the comparison.
After some search the simplest solution i found was in Anthony Bloesch
WebLog.
Just add some text (a char is enough) to the end of the data (append)
SELECT 'foo' WHERE 'bar' + 'BOGUS_TXT' = 'bar ' + 'BOGUS_TXT'
Also works for 'WHERE IN'
SELECT <columnA>
FROM <tableA>
WHERE <columnA> + 'BOGUS_TXT' in ( SELECT <columnB> + 'BOGUS_TXT' FROM <tableB> )
The approach I’m planning to use is to use a normal comparison which should be index-keyable (“sargable”) supplemented by a DATALENGTH (because LEN ignores the whitespace). It would look like this:
DECLARE #testValue VARCHAR(MAX) = 'x';
SELECT t.Id, t.Value
FROM dbo.MyTable t
WHERE t.Value = #testValue AND DATALENGTH(t.Value) = DATALENGTH(#testValue)
It is up to the query optimizer to decide the order of filters, but it should choose to use an index for the data lookup if that makes sense for the table being tested and then further filter down the remaining result by length with the more expensive scalar operations. However, as another answer stated, it would be better to avoid these scalar operations altogether by using an indexed calculated column. The method presented here might make sense if you have no control over the schema , or if you want to avoid creating the calculated columns, or if creating and maintaining the calculated columns is considered more costly than the worse query performance.
I've only really got two suggestions. One would be to revisit the design that requires you to store trailing spaces - they're always a pain to deal with in SQL.
The second (given your SARG-able comments) would be to add acomputed column to the table that stores the length, and add this column to appropriate indexes. That way, at least, the length comparison should be SARG-able.

Resources