Regex string with 2+ different numbers and some optional characters in Snowflake syntax - snowflake-cloud-data-platform

I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!

Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;

Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!

Related

Snowflake REGEX to identify if column contains any digit

I currently have select REGEXP_LIKE(col, '[0-9]+') which seems to return True only if all the characters in the string are numeric.
For example, it returns True for 12345 but False for something like 100 Apple St.
What is the necessary regex pattern to return True in both examples above?
To check if a column contains any digit, you can modify your current pattern to use the .* character to match any number of characters before or after the digit(s):
SELECT REGEXP_LIKE(col, '.*[0-9].*')

Why does the EXCEPT clause trim whitespace at the end of text?

I read through the documentation for the SqlServer EXCEPT operator and I see no mention of explicit trimming of white space at the end of a string. However, when running:
SELECT 'Test'
EXCEPT
SELECT 'Test '
no results are returned. Can anyone explain this behavior or how to avoid it when using EXCEPT?
ANSI SQL-92 requires strings to be the same length before comparing and the pad character is a space.
See https://support.microsoft.com/en-us/help/316626/inf-how-sql-server-compares-strings-with-trailing-spaces for more information
In the ANSI standard (accessed here section 8.2 )
3) The comparison of two character strings is determined as follows:
a) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of
itself that has been extended to the length of the longer
string by concatenation on the right of one or more pad char-
acters, where the pad character is chosen based on CS. If
CS has the NO PAD attribute, then the pad character is an
implementation-dependent character different from any char-
acter in the character set of X and Y that collates less
than any string under CS. Otherwise, the pad character is a
<space>.
b) The result of the comparison of X and Y is given by the col-
lating sequence CS.
c) Depending on the collating sequence, two strings may com-
pare as equal even if they are of different lengths or con-
tain different sequences of characters. When the operations
MAX, MIN, DISTINCT, references to a grouping column, and the
UNION, EXCEPT, and INTERSECT operators refer to character
strings, the specific value selected by these operations from
a set of such equal values is implementation-dependent.
If this behaviour must be avoided, you can reverse the columns as part of your EXCEPT:
SELECT 'TEST', REVERSE('TEST')
EXCEPT
SELECT 'TEST ', REVERSE('TEST ')
which gives the expected result, though is quite annoying especially if you're dealing with multiple columns.
The alternative would be to find a collating sequence with an alternate pad character or a no pad option set, though this seems to not exist in t-sql after a quick google.
Alternatively, you could terminate each column with a character and then substring it out in the end:
SELECT SUBSTRING(col,1,LEN(col) -1) FROM
(
SELECT 'TEST' + '^' as col
EXCEPT
SELECT 'TEST ' + '^'
) results

Regular expression unexpected pattern matching

I am trying to create a syntax parser using C-Bison and Flex. In Flex I have a regular expression which matches integers based on the following:
Must start with any digit in range 1-9 and followed by any number of digits in range 0-9. (ex. Correct: 1,12,11024 | Incorrect: 012)
Can be signed (ex. +2,-5)
The number 0 must not be followed by any digit (0-9) and must not signed. (ex. Correct: 0 | Incorrect: 012,+0,-0)
Here is the regex I have created to perform the matching:
[^+-]0[^0-9]|[+-]?[1-9][0-9]*
Here is the expression I am testing:
(1 + 1 + 10)
The matches:
1
1
10)
And here is my question, why does it match '10)'?
The reason I used the above expression, instead of the much simpler one,
(0|[+-]?[1-9][0-9]*) is due to inability of the parser to recognise incorrect expressions such as 012.
The problem seems to occur only when before the ')' precedes the digit '0'. However if the '0' is preceded by two or more digits (ex. 100), then the ')' is not matched.
I know for a fact if I remove [^0-9] from the regex it doesn't match the ')'.
It matches 10( because 1 matches [^+-], 0 matches 0 and ( matches [^0-9].
The reason I used the above expression, instead of the much simpler one, (0|[+-]?[1-9][0-9]*) is due to inability of the parser to recognise incorrect expressions such as 012.
How so? Using the above regex, 012 would be recognized as two tokens: 0 and 12. Would this not cause an error in your parser?
Admittedly, this would not produce a very good error message, so a better approach might be to just use [0-9]+ as the regex and then use the action to check for a leading zero. That way 012 would be a single token and the lexer could produce an error or warning about the leading zero (I'm assuming here that you actually want to disallow leading zeros - not use them for octal literals).
Instead of a check in the action, you could also keep your regex and then add another one for integers with a leading zero (like 0[0-9]+ { warn("Leading zero"); return INT; }), but I'd go with the check in the action since it's an easy check and it keeps the regex short and simple.
PS: If you make - and + part of the integer token, something like 2+3 will be seen as the integer 2, followed by the integer +3, rather than the integers 2 and 3 with a + token in between. Therefore it is generally a better idea to not make the sign a part of the integer token and instead allow prefix + and - operators in the parser.

Regex for one column that has numbers and letters but not one or the other

I am attempting to search a column that contains alphanumeric ids in it but want to write a query that returns records with letters and numbers but not one or the other.
i.e Acceptable: jjk44kndkfndFF
i.e Not acceptable: 223232323232 or aajnfdskDFdd
So far I have:
where PATINDEX('%[^a-zA-Z0-9 ]%',columnInQuestion)
This returns all alphanumeric records. Any direction appreciated
I think you need three predicates in the WHERE clause:
WHERE (columnInQuestion NOT LIKE '%[^a-zA-Z0-9]%') AND
(PATINDEX('%[a-zA-Z]%', columnInQuestion) <> 0) AND
(PATINDEX('%[0-9]%', columnInQuestion) <> 0)
First predicate (columnInQuestion NOT LIKE '%[^a-zA-Z0-9]%') is true if columnInQuestion contains only alphanumeric characters
Second predicate (PATINDEX('%[a-zA-Z]%', columnInQuestion) <> 0) is true if there is at least one alphabetic character in columnInQuestion
Third predicate (PATINDEX('%[0-9]%', columnInQuestion) <> 0) is true if there is at least one numeric character in columnInQuestion
It can be done with just one regexp:
^[a-zA-Z0-9]*([a-zA-Z][0-9]|[0-9][a-zA-Z])[a-zA-Z0-9]*$
It starts and ends with 0-x legal chars.
And somewhere there is a switch from a letter to a digit or from a digit to a letter.

Lex/Flex - Split the phone number Up?

I am making a program which got to split the phone-number apart, each part has been divided by a hyphen (or spaces, or '( )' or empty).
Exp: Input: 0xx-xxxx-xxxx or 0xxxxxxxxxx or (0xx)xxxx-xxxx
Output: code 1: 0xx
code 2: xxxx
code 3: xxxx
But my problem is: sometime "Code 1" is just 0x -> so "Code 2" must be xxxxx (1st part always have hyphen or a parenthesis when 2 digit long)
Anyone can give me a hand, It would be grateful.
According to your comments, the following regex will extract the information you need
^\(?(0\d{1,2})\)?[- ]?(\d{4,5})[- ]?(\d{4})$
Break down:
^\(?(0\d{1,2})\)? matches 0x, 0xx, (0xx) and (0x) at he beggining of the string
[- ]? as parenthesis can only be used for the first group, the only valid separators left are space and the hyphen. ? means 0 or 1 time.
(\d{4,5}) will match the second group. As the length of the 3rd group is fixed (4 digits), the regex will automatically calculate the length of the Group1 and 2.
(\d{4})$ matches the 4 digits at the end of the number.
See it in action
You can the extract data from capture group 1,2 and 3
Note: As mentionned in the comments of the OP, this only extracts data from correctly formed numbers. It will match some ill-formed numbers.

Resources