Underscore in Where clause yields unexpected result, why? [duplicate] - sql-server

This question already has answers here:
LIKE not working in TSQL
(3 answers)
Closed 3 years ago.
I want to find all tables starting with TB_, hence I've wrote following script:
select *
from INFORMATION_SCHEMA.TABLES
where TABLE_NAME like 'TB_%'
To my surprise I got following result:
TB103_xxx
TB037_bbb
TB104_ccc
I'm curious why?

It means any single character in combination with a like. See
MSDN - LIKE (Transact-SQL)
% - Any string of zero or more characters.
_ - Any single character. _a will match aa, ba etc.
[ ] - Any single character within the specified range ([a-f]) or set ([abcdef]).
[^] - Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
You could use [_] to match a underscore, so like 'TB[_]%'
Or you could use LIKE 'TB\_%' ESCAPE '\'. (thanks to Jeroen Mostert)

This is why because you used the underscore (_) symbol. It means the string Allows you to match on a single character.
Check this SQL LIKE Operator
Better you should use WHERE COLUMN_NAME LIKE 'TB[_]%' or WHERE COLUMN_NAME LIKE 'TB\_%'
% - The percent sign represents zero, one, or multiple characters.
_ - The underscore represents a single character.
[] - Any single character within the specified range ([a-f]) or set ([abcdef]).
[^] - Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
Here is some examples
WHERE CustomerName LIKE 'a%' Finds any values that start with "a"
WHERE CustomerName LIKE '%a' Finds any values that end with "a"
WHERE CustomerName LIKE '%or%' Finds any values that have "or" in any position
WHERE CustomerName LIKE '_r%' Finds any values that have "r" in the second position
WHERE CustomerName LIKE 'a_%' Finds any values that start with "a" and are at least 2 characters in length
WHERE CustomerName LIKE 'a%o' Finds any values that start with "a" and ends with "o"
WHERE CustomerName LIKE '[a-e]arsen' Finds any values that end with "arsen" and starting with any single character between "a" and "e"
WHERE CustomerName LIKE '[^a-e]arsen' Finds any values that end with "arsen" and starting with any single character isn't between "a" and "e".

Related

Regex string with 2+ different numbers and some optional characters in Snowflake syntax

I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!

Why does the EXCEPT clause trim whitespace at the end of text?

I read through the documentation for the SqlServer EXCEPT operator and I see no mention of explicit trimming of white space at the end of a string. However, when running:
SELECT 'Test'
EXCEPT
SELECT 'Test '
no results are returned. Can anyone explain this behavior or how to avoid it when using EXCEPT?
ANSI SQL-92 requires strings to be the same length before comparing and the pad character is a space.
See https://support.microsoft.com/en-us/help/316626/inf-how-sql-server-compares-strings-with-trailing-spaces for more information
In the ANSI standard (accessed here section 8.2 )
3) The comparison of two character strings is determined as follows:
a) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of
itself that has been extended to the length of the longer
string by concatenation on the right of one or more pad char-
acters, where the pad character is chosen based on CS. If
CS has the NO PAD attribute, then the pad character is an
implementation-dependent character different from any char-
acter in the character set of X and Y that collates less
than any string under CS. Otherwise, the pad character is a
<space>.
b) The result of the comparison of X and Y is given by the col-
lating sequence CS.
c) Depending on the collating sequence, two strings may com-
pare as equal even if they are of different lengths or con-
tain different sequences of characters. When the operations
MAX, MIN, DISTINCT, references to a grouping column, and the
UNION, EXCEPT, and INTERSECT operators refer to character
strings, the specific value selected by these operations from
a set of such equal values is implementation-dependent.
If this behaviour must be avoided, you can reverse the columns as part of your EXCEPT:
SELECT 'TEST', REVERSE('TEST')
EXCEPT
SELECT 'TEST ', REVERSE('TEST ')
which gives the expected result, though is quite annoying especially if you're dealing with multiple columns.
The alternative would be to find a collating sequence with an alternate pad character or a no pad option set, though this seems to not exist in t-sql after a quick google.
Alternatively, you could terminate each column with a character and then substring it out in the end:
SELECT SUBSTRING(col,1,LEN(col) -1) FROM
(
SELECT 'TEST' + '^' as col
EXCEPT
SELECT 'TEST ' + '^'
) results

What is the use of ^ in patindex in SQL Server?

When I execute this
select PATINDEX('%[0 ]%', '03/SI/00807/18-19')
I am getting 1.
By using ^ like this:
select PATINDEX('%[^0 ]%', '03/SI/00807/18-19')
I am getting 2.
[^] Allows you to match on any character not in the [^] brackets (for example, [^abc] would match on any character that is not a, b, or c characters) Whereas
[ ] Allows you to match on any character in the [ ] brackets (for example, [abc] would match on a, b, or c characters)
_ Allows you to match on a single character
% Allows you to match any string of any length (including zero length)
[^abcd] means: any one character EXCEPT a,b,c or d
select PATINDEX('%[0 ]%', '03/SI/00807/18-19')
The first character in your string which is (0 or space) is the 0 in the first place, so patindex returns 1.
select PATINDEX('%[^0 ]%', '03/SI/00807/18-19')
The first character in your string which is (neither 0 nor space) is the 3 in the second place, so patindex returns 2.

SQL Select statement until a character

I'm looking to extract all the text up until a '\' (backslash).
The substring is required to remove all proceeding characters (17 in total) and so I would like to return all after the 17th until it comes across a backslash.
I've tried using charindex but it doesn't seem to stop at the \ it returns characters afterward. My code is as follows
SELECT path, substring(path,17, CHARINDEX('\',Path)+ LEN(Path)) As Data
FROM [Table].[dbo].[Projects]
WHERE Path like '\ENQ%\' AND
Deleted = '0'
Example
The below screen shot shows the basic query and result i.e the whole string
I then use substring to remove the first X characters as there will always be the same amount of proceeding characters
But what Im actually after is (based on the above result) the "Testing 1" "Testing 2" and "Testing ABC" section
The substring is required to remove all proceeding characters (17 in total) and so I would like to return all after the 17th until it comes across a backslash.
select
substring(path,17,CHARINDEX('\',Path)-17)
from
table
To overcome Invalid length parameter passed to the LEFT or SUBSTRING function error, you can use CASE
select
substring(path,17,
CASE when CHARINDEX('\',Path,17)>0
Then CHARINDEX('\',Path)-17)
else VA end
)
from
table

Sql Server's regex LIKE - behaviour clarification?

Someone asked here how to get only values which are a number :
So , if the table is :
DECLARE #Table TABLE(
Col nVARCHAR(50)
)
INSERT INTO #Table SELECT 'ABC'
INSERT INTO #Table SELECT '234.62'
INSERT INTO #Table SELECT '10:10:10:10'
INSERT INTO #Table SELECT 'France'
INSERT INTO #Table SELECT '2'
then - the desired results are :
234.62
2
But when I tested this query :
SELECT * FROM #Table WHERE Col LIKE '%[0-9.]%' --expected to see only 234.62
it showed :
234.62
10:10:10:10
2
Question #1
How come 10:10:10:10 , 2 satisfies the condition ?
Question #2
I saw this answer here which does work
SELECT * FROM #Table WHERE Col NOT LIKE '%[^0-9.]%'
But I don't understand why this works. AFAIU - it selects all values which are not like (not(has number) and not( has dot)) which is ===>(de morgan)===> not like ( has number or has dot)
Can someone please shed light ?
nb I already know that isnumeric can be used also , but it's unsafe (+). also valid wildcards are %,_,[],[^]
Any particular use of [set] within a LIKE expression is a check against one character in the target string.
So, LIKE '%[0-9.]%' says - % - match 0-to-many arbitrary characters, then [0-9.] match one character in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character in the set 0-9.". So, 10:10:10:10 can be matched as 0 arbitrary characters, then 1 matches [0-9.], and then 0:10:10:10 matches the final %.
LIKE '%[^0-9.]%' says - % - match 0-to-many arbitrary characters, then [^0-9.] match one character not in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character outside of the set 0-9.. So when we apply the NOT to the front of that, we are saying "match any string that doesn't contain at least one character outside of the set 0-9." or "match strings that only contain characters in the set 0-9..
Essentially, the double-negative is a way to make an assertion about all characters in the string.

Resources