SQL Server 2008 Empty String vs. Space - sql-server

I ran into something a little odd this morning and thought I'd submit it for commentary.
Can someone explain why the following SQL query prints 'equal' when run against SQL 2008. The db compatibility level is set to 100.
if '' = ' '
print 'equal'
else
print 'not equal'
And this returns 0:
select (LEN(' '))
It appears to be auto trimming the space. I have no idea if this was the case in previous versions of SQL Server, and I no longer have any around to even test it.
I ran into this because a production query was returning incorrect results. I cannot find this behavior documented anywhere.
Does anyone have any information on this?

varchars and equality are thorny in TSQL. The LEN function says:
Returns the number of characters, rather than the number of bytes, of the given string expression, excluding trailing blanks.
You need to use DATALENGTH to get a true byte count of the data in question. If you have unicode data, note that the value you get in this situation will not be the same as the length of the text.
print(DATALENGTH(' ')) --1
print(LEN(' ')) --0
When it comes to equality of expressions, the two strings are compared for equality like this:
Get Shorter string
Pad with blanks until length equals that of longer string
Compare the two
It's the middle step that is causing unexpected results - after that step, you are effectively comparing whitespace against whitespace - hence they are seen to be equal.
LIKE behaves better than = in the "blanks" situation because it doesn't perform blank-padding on the pattern you were trying to match:
if '' = ' '
print 'eq'
else
print 'ne'
Will give eq while:
if '' LIKE ' '
print 'eq'
else
print 'ne'
Will give ne
Careful with LIKE though: it is not symmetrical: it treats trailing whitespace as significant in the pattern (RHS) but not the match expression (LHS). The following is taken from here:
declare #Space nvarchar(10)
declare #Space2 nvarchar(10)
set #Space = ''
set #Space2 = ' '
if #Space like #Space2
print '#Space Like #Space2'
else
print '#Space Not Like #Space2'
if #Space2 like #Space
print '#Space2 Like #Space'
else
print '#Space2 Not Like #Space'
#Space Not Like #Space2
#Space2 Like #Space

The = operator in T-SQL is not so much "equals" as it is "are the same word/phrase, according to the collation of the expression's context," and LEN is "the number of characters in the word/phrase." No collations treat trailing blanks as part of the word/phrase preceding them (though they do treat leading blanks as part of the string they precede).
If you need to distinguish 'this' from 'this ', you shouldn't use the "are the same word or phrase" operator because 'this' and 'this ' are the same word.
Contributing to the way = works is the idea that the string-equality operator should depend on its arguments' contents and on the collation context of the expression, but it shouldn't depend on the types of the arguments, if they are both string types.
The natural language concept of "these are the same word" isn't typically precise enough to be able to be captured by a mathematical operator like =, and there's no concept of string type in natural language. Context (i.e., collation) matters (and exists in natural language) and is part of the story, and additional properties (some that seem quirky) are part of the definition of = in order to make it well-defined in the unnatural world of data.
On the type issue, you wouldn't want words to change when they are stored in different string types. For example, the types VARCHAR(10), CHAR(10), and CHAR(3) can all hold representations of the word 'cat', and ? = 'cat' should let us decide if a value of any of these types holds the word 'cat' (with issues of case and accent determined by the collation).
Response to JohnFx's comment:
See Using char and varchar Data in Books Online. Quoting from that page, emphasis mine:
Each char and varchar data value has a collation. Collations define
attributes such as the bit patterns used to represent each character,
comparison rules, and sensitivity to case or accenting.
I agree it could be easier to find, but it's documented.
Worth noting, too, is that SQL's semantics, where = has to do with the real-world data and the context of the comparison (as opposed to something about bits stored on the computer) has been part of SQL for a long time. The premise of RDBMSs and SQL is the faithful representation of real-world data, hence its support for collations many years before similar ideas (such as CultureInfo) entered the realm of Algol-like languages. The premise of those languages (at least until very recently) was problem-solving in engineering, not management of business data. (Recently, the use of similar languages in non-engineering applications like search is making some inroads, but Java, C#, and so on are still struggling with their non-businessy roots.)
In my opinion, it's not fair to criticize SQL for being different from "most programming languages." SQL was designed to support a framework for business data modeling that's very different from engineering, so the language is different (and better for its goal).
Heck, when SQL was first specified, some languages didn't have any built-in string type. And in some languages still, the equals operator between strings doesn't compare character data at all, but compares references! It wouldn't surprise me if in another decade or two, the idea that == is culture-dependent becomes the norm.

I found this blog article which describes the behavior and explains why.
The SQL standard requires that string
comparisons, effectively, pad the
shorter string with space characters.
This leads to the surprising result
that N'' = N' ' (the empty string
equals a string of one or more space
characters) and more generally any
string equals another string if they
differ only by trailing spaces. This
can be a problem in some contexts.
More information also available in MSKB316626

There was a similar question a while ago where I looked into a similar problem here
Instead of LEN(' '), use DATALENGTH(' ') - that gives you the correct value.
The solutions were to use a LIKE clause as explained in my answer in there, and/or include a 2nd condition in the WHERE clause to check DATALENGTH too.
Have a read of that question and links in there.

To compare a value to a literal space, you may also use this technique as an alternative to the LIKE statement:
IF ASCII('') = 32 PRINT 'equal' ELSE PRINT 'not equal'

Sometimes one has to deal with spaces in data, with or without any other characters, even though the idea of using Null is better - but not always usable.
I did run into the described situation and solved it this way:
... where ('>' + #space + '<') <> ('>' + #space2 + '<')
Of course you wouldn't do that for large amount of data but it works quick and easy for some hundred lines ...

As SQL - 92 8.2 comparison predicate saying:
If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of
itself that has been extended to the length of the longer
string by concatenation on the right of one or more pad char-
acters, where the pad character is chosen based on CS. If
CS has the NO PAD attribute, then the pad character is an
implementation-dependent character different from any char-
acter in the character set of X and Y that collates less
than any string under CS. Otherwise, the pad character is a
<space>.

How to distinct records on select with fields char/varchar on sql server:
example:
declare #mayvar as varchar(10)
set #mayvar = 'data '
select mykey, myfield from mytable where myfield = #mayvar
expected
mykey (int) | myfield (varchar10)
1 | 'data '
obtained
mykey | myfield
1 | 'data'
2 | 'data '
even if I write
select mykey, myfield from mytable where myfield = 'data' (without final blank)
I get the same results.
how I solved? In this mode:
select mykey, myfield
from mytable
where myfield = #mayvar
and DATALENGTH(isnull(myfield,'')) = DATALENGTH(#mayvar)
and if there is an index on myfield, it'll be used in each case.
I hope it will be helpful.

Another way is to put it back into a state that the space has value.
eg: replace the space with a character known like the _
if REPLACE('hello',' ','_') = REPLACE('hello ',' ','_')
print 'equal'
else
print 'not equal'
returns: not equal
Not ideal, and probably slow, but is another quick way forward when needed quickly.

Related

Invalid length parameter passed to the SUBSTRING (when using charindex - only for certain char)

I have a database where the majority of testid have information in parentheses which I want to extract. So I've written (simplified version)
select
case when wt.testid like '%(%)%'
then substring(wt.testid,
charindex('(',wt.testid)+1,
(charindex(')',wt.testid) - charindex('(',wt.testid) - 1)
)
which is working just fine, but not all of these tests have this format, some have this information after an underscore and before a space,
so I added this extra part
else substring(wt.testid,
charindex('_',wt.testid)+1,
(charindex(' ',wt.testid) - charindex('_',wt.testid) -1)
)
but for some reason this causes 'Invalid length parameter passed to the SUBSTRING function.'
If I remove the -1 then the query works, which suggests that the -1 is causing the length parameter to be negative, but all of the results of the query have a space and an underscore, and when I pass
(charindex(' ',wt.testid) - charindex('_',wt.testid) -1)
into my select I always get positive values, so I don't know what's going on.
Even if you have filters that mean that only results containing parentheses or the underscore and space format are returned in your result set, SQL Server will not guarantee not to evaluate some expressions eagerly and generate errors for rows which do not actually contribute to the final result.
You can usually cope with this by making judicious use of CASE expressions to retest facts that you believe are already known:
select
case when wt.testid like '%(%)%'
then substring(wt.testid,
charindex('(',wt.testid)+1,
(charindex(')',wt.testid) - charindex('(',wt.testid) - 1)
)
when wt.testid like '%!_% %' escape '!'
then substring(wt.testid,
charindex('_',wt.testid)+1,
(charindex(' ',wt.testid) - charindex('_',wt.testid) -1)
)
end
See SQL Server should not raise illogical errors if you want to share in the despair that MS seem intent on never fixing this issue.

T-SQL Regex for social security number (SQL Server 2008 R2)

I need to find invalid social security numbers in a varchar field in a SQL Server 2008 database table. (Valid SSNs are being defined by being in the format ###-##-#### - doesn't matter what the numbers are, as long as they are in that "3-digit dash 2-digit dash 4-digit" pattern.
I do have a working regex:
SELECT *
FROM mytable
WHERE ssn NOT LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'
That does find the invalid SSNs in the column, but I know (okay - I'm pretty sure) that there is a way to shorten that to indicate that the previous pattern can have x iterations.
I thought this would work:
'[0-9]{3}-[0-9]{2}-[0-9]{4}'
But it doesn't.
Is there a shorter regex than the one above in the select, or not? Or perhaps there is, but T-SQL/SQL Server 2008 doesn't support it!?
If you plan to get a shorter variant of your LIKE expression, then the answer is no.
In T-SQL, you can only use the following wildcards in the pattern:
%
- Any string of zero or more characters.
WHERE title LIKE '%computer%' finds all book titles with the word computer anywhere in the book title.
_ (underscore)
Any single character.
WHERE au_fname LIKE '_ean' finds all four-letter first names that end with ean (Dean, Sean, and so on).
[ ]
Any single character within the specified range ([a-f]) or set ([abcdef]).
WHERE au_lname LIKE '[C-P]arsen' finds author last names ending with arsen and starting with any single character between C and P, for example Carsen, Larsen, Karsen, and so on. In range searches, the characters included in the range may vary depending on the sorting rules of the collation.
[^]
Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
So, your LIKE statement is already the shortest possible expression. No limiting quantifiers can be used (those like {min,max}), not shorthand classes like \d.
If you were using MySQL, you could use a richer set of regex utilities, but it is not the case.
I suggest you to use another solution like this:
-- Use `REPLICATE` if you really want to use a number to repeat
Declare #rgx nvarchar(max) = REPLICATE('#', 3) + '-' +
REPLICATE('#', 2) + '-' +
REPLICATE('#', 4);
-- or use your simple format string
Declare #rgx nvarchar(max) = '###-##-####';
-- then use this to get your final `LIKE` string.
Set #rgx = REPLACE(#rgx, '#', '[0-9]');
And you can also use something like '_' for characters then replace it with [A-Z] and so on.

Unable to return query Thai data

I have a table with columns that contain both thai and english text data. NVARCHAR(255).
In SSMS I can query the table and return all the rows easy enough. But if I then query specifically for one of the Thai results it returns no rows.
SELECT TOP 1000 [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
Returns
Province District SubDistrict Branch
อุตรดิตถ์ ลับแล ศรีพนมมาศ Northern
Bangkok Khlong Toei Khlong Tan SSS1
But this query:
SELECT [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
where [Province] LIKE 'อุตรดิตถ์'
Returns no rows.
What do I need o do to get the expected results.
The collation set is Latin1_General_CI_AS.
The data is displayed and inserted with no errors just can't search.
Two problems:
The string being passed into the LIKE clause is VARCHAR due to not being prefixed with a capital "N". For example:
SELECT 'อุตรดิตถ์' AS [VARCHAR], N'อุตรดิตถ์' AS [NVARCHAR]
-- ????????? อุตรดิตถ
What is happening here is that when SQL Server is parsing the query batch, it needs to determine the exact type and value of all literals / constants. So it figures out that 12 is an INT and 12.0 is a NUMERIC, etc. It knows that N'ดิ' is NVARCHAR, which is an all-inclusive character set, so it takes the value as is. BUT, as noted before, 'ดิ' is VARCHAR, which is an 8-bit encoding, which means that the character set is controlled by a Code Page. For string literals and variables / parameters, the Code Page used for VARCHAR data is the Database's default Collation. If there are characters in the string that are not available on the Code Page used by the Database's default Collation, they are either converted to a "best fit" mapping, if such a mapping exists, else they become the default replacement character: ?.
Technically speaking, since the Database's default Collation controls string literals (and variables), and since there is a Code Page for "Thai" (available in Windows Collations), then it would be possible to have a VARCHAR string containing Thai characters (meaning: 'ดิ', without the "N" prefix, would work). But that would require changing the Database's default Collation, and that is A LOT more work than simply prefixing the string literal with "N".
For an in-depth look at this behavior, please see my two-part series:
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part A of 2: “Duck”)
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part B of 2: “Rabbit”)
You need to add the wildcard characters to both ends:
N'%อุตรดิตถ์%'
The end result will look like:
WHERE [Province] LIKE N'%อุตรดิตถ์%'
EDIT:
I just edited the question to format the "results" to be more readable. It now appears that the following might also work (since no wildcards are being used in the LIKE predicate in the question):
WHERE [Province] = N'อุตรดิตถ์'
EDIT 2:
A string (i.e. something inside of single-quotes) is VARCHAR if there is no "N" prefixed to the string literal. It doesn't matter what the destination datatype is (e.g. an NVARCHAR(255) column). The issue here is the datatype of the source data, and that source is a string literal. And unlike a string in .NET, SQL Server handles 'string' as an 8-bit encoding (VARCHAR; ASCII values 0 - 127 same across all Code Pages, Extended ASCII values 128 - 255 determined by the Code Page, and potentially 2-byte sequences for Double-Byte Character Sets) and N'string' as UTF-16 Little Endian (NVARCHAR; Unicode character set, 2-byte sequences for BMP characters 0 - 65535, two 2-byte sequences for Code Points above 65535). Using 'string' is the same as passing in a VARCHAR variable. For example:
DECLARE #ASCII VARCHAR(20);
SET #ASCII = N'อุตรดิตถ์';
SELECT #ASCII AS [ImplicitlyConverted]
-- ?????????
Could be a number of things!
Fist of print out the value of the column and your query string in hex.
SELECT convert(varbinary(20)Province) as stored convert(varbinary(20),'อุตรดิตถ์') as query from allDistricsBranches;
This should give you some insight to the problem. I think the most likely cause is the ั, ิ, characters being typed in the wrong sequence. They are displayed as part of the main letter but are stored internally as separate characters.

String manipulation in a column in an Oracle table

One of my tables' column contains names, for example as "Obama, Barack" (with double quotes). I was wondering if we can do something to make it appear as Barack Obama in the tables. I think we can do it with declaring a variable but just could not manage to find a solution.
And yes as this table contains the multiple transactions of the same person we also end up with having multiple rows of "Obama, Barack"... a data warehouse concept (fact tables).
What #Ben has said is correct. Having two columns one for first name and one for last name is correct.
However if you wish to update the entire database as it is you could do...
/*This will swap the order round*/
UPDATE TableName SET NameColumn = SUBSTRING(NameColumn, 1, CHARINDEX(',',NameColumn))+SUBSTRING(NameColumn, CHARINDEX(',', NameColumn),LEN(NameColumn)-CHARINDEX('"', NameColumn,2))
/*This will remove the quotes*/
UPDATE TableName SET NameColumn = REPLACE(NameColumn, '"', '')
Edit:- but as I can't see your data you may have to edit it slightly. But the theory is correct. See here http://www.technoreader.com/SQL-Server-String-Functions.aspx
From the question I assume you want to:
Remove the quotes
Remove the comma
Swap the names
So Regexp_replace is probably your best bet
UPDATE tablename
SET column_name = REGEXP_REPLACE( column_name, '^"(\w+), (\w+)"$', '\2 \1' )
So regexp_replace is changing the column value as long as it matches the pattern exactly. What the parts of the expression are
^" means it must start with a double quote
(\w+) means immediately followed by a string of 1 or more alphanumeric characters. This string is then saved as the variable \1 because its the first set of ()
, means immediately followed by a comma and a space
(\w+) means immediately followed by a string of 1 or more alphanumeric characters. This string is then saved as the variable \2 because its the second set of ()
"$ means immediately follwed by a double quote which is the end of the string
\2 \1 is the replacement string, the second saved string followed by a space followed by the first saved string
So anything which does not exactly match these conditions will not be replaced. So if you have an leading or traling spaces, or more than one space after the comma, or many other reasons the text will not be replaced.
A much more flexible (maybe too flexible) option could be:
UPDATE tablename
SET column_name = REGEXP_REPLACE( column_name, '^\W*(\w+)\W+(\w+)\W*$', '\2 \1' )
This is similar but effectively makes the quotes and the comma optional, and deals with any other leading or trailing pubctuation or whitespace.
^\W* means must start with zero or more non-alphanumberics
(\w+)\W+(\w+) means two alphanumberic strings separated by one or more non-alphanumerics. The two strings are saved as described above
\W*$ means must then end with zero or more non-alphanumberics
More info on regexp in oracle is here
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/ap_posix.htm
http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm

How to get number of chars in string in Transact SQL, the "other way"

We faced a very strange issue (really strange for such mature product):
how to get number of characters in Unicode string using Transact-SQL statements.
The key problem of this issue that the len() TSQL function returns number of chars, excluding trailing blanks. The other variant is to use datalength (which return number of bytes) and divide by 2, so get numbers of Unicode chars. But Unicode chars can be surrogate pairs so it won't work either.
We have 2 variants of solution: the first is to use len(replace()) and the second is add a single symbol and then subtract 1 from result. But IMO both variants are rather ugly.
declare #txt nvarchar(10)
set #txt = 'stack '
select #txt as variable,
len(#txt) as lenBehaviour,
DATALENGTH(#txt)/2 as datalengthBehaviour,
len(replace(#txt,' ','O')) as ReplaceBehaviour,
len(#txt+'.')-1 as addAndMinusBehaviour
Any other ideas how to count chars in string with trailing spaces?
I can't leave a comment so I will have to leave an answer (or shutup).
My vote would be for the addAndMinusBehaviour
I haven't got a good third alternative, there maybe some obscure whitespace rules to fiddle with in the options / SET / Collation assignment but don't know more detail off the top of my head.
but really addAndMinusBehaviour is probably the eaiest to implement, fastest to execute and if you document it, farily maintainable as well.
CREATE FUNCTION [dbo].[ufn_CountChar] ( #pInput VARCHAR(1000), #pSearchChar CHAR(1) )
RETURNS INT
BEGIN
RETURN (LEN(#pInput) - LEN(REPLACE(#pInput, #pSearchChar, '')))
END
GO
My understanding is that DATALENGTH(#txt)/2 should always give you the number of characters. SQL Server stores Unicode characters in UCS-2 which does not support surrogate pairs.
http://msdn.microsoft.com/en-us/library/ms186939.aspx
http://en.wikipedia.org/wiki/UCS2

Resources