Snowflake regexp & split - snowflake-cloud-data-platform

I have column values that are in between starting of the underscore() and ending with an underscore().
I am trying to see how to extract a value between 2 underscores (_). for example, "xxxx_Whats your number 23345_xxxxx".
I want to discard everything before and after underscore(_).
Any help is greatly appreciated.

REGEXP_SUBSTRING, using a grouping match ( ) and turning on sub-matches 'e', and selecting the first match.. then stating you want to see an underscore, and then many not underscores, and then a underscore.
select
column1,
regexp_substr(column1, '_([^_]*)_',1,1,'e')
from values
('xxxx_Whats your number 23345_xxxxx')
gives:
COLUMN1
REGEXP_SUBSTR(COLUMN1, '([^]*)_',1,1,'E')
xxxx_Whats your number 23345_xxxxx
Whats your number 23345
hmm, you mention discard before and after, thus if you want to include the underscore you will need to move them into the grouping brackets:
select
column1,
regexp_substr(column1, '_([^_]*)_',1,1,'e') as exclude_underscore,
regexp_substr(column1, '(_[^_]*_)',1,1,'e') as include_underscore
from values
('xxxx_Whats your number 23345_xxxxx'),
('has no first underscore_xxxxx'),
('xxx_has no last underscore'),
('nothing between__the underscores');
COLUMN1
EXCLUDE_UNDERSCORE
INCLUDE_UNDERSCORE
xxxx_Whats your number 23345_xxxxx
Whats your number 23345
_Whats your number 23345_
has no first underscore_xxxxx
null
null
xxx_has no last underscore
null
null
nothing between__the underscores
__
then you might also want, atleast 1 character between the underscrores, and thus should change the * to a + or a {n,}

Related

Is there a way to find values that contain only 0's and a symbol of any length?

I want to find strings of any length that contain only 0's and a symbol such as a / a . or a -
Examples include 0__0 and 000/00/00000 and .00000
Considering this sample data:
CREATE TABLE dbo.things(thing varchar(255));
INSERT dbo.things(thing) VALUES
('0__0'),('000/00/00000'),('00000'),('0123456');
Try the following, which locates the first position of any character that is NOT a 0, a decimal, a forward slash, or an underscore. PATINDEX returns 0 if the pattern is not found.
SELECT thing FROM dbo.things
WHERE PATINDEX('%[^0^.^/^_]%', thing) = 0;
Results:
thing
0__0
000/00/00000
00000
The opposite:
SELECT thing FROM dbo.things
WHERE PATINDEX('%[^0^.^/^_]%', thing) > 0;
Results:
thing
0123food456
Example db<>fiddle
I can see a way of doing this... But it's something that wouldn't perform well, if you think about using it as a search criteria.
We are going to use a translate function on SQL Server, to replace the allowed characters, or symbols as you've said, with a zero. And then, eliminates the zeroes. If the result is an empty string, then there are two cases, or it only had zeroes and allowed characters, or it already was an empty string.
So, checking for this and for non-empty strings, we can define if it matches your criteria.
-- Test scenario
create table #example (something varchar(200) )
insert into #example(something) values
--Example cases from Stack Overflow
('0__0'),('000/00/00000'),('.00000'),
-- With something not allowed (don't know, just put a number)
('1230__0'),('000/04560/00000'),('.00000789'),
-- Just not allowed characters, zero, blank, and NULL
('1234567489'),('0'), (''),(null)
-- Shows the data, with a column to check if it matches your criteria
select *
from #example e
cross apply (
select case when
-- If it *must* have at least a zero
e.something like '%0%' and
-- Eliminates zeroes
replace(
-- Replaces the allowed characters with zero
translate(
e.something
,'_./'
,'000'
)
,'0'
,''
) = ''
then cast(1 as bit)
else cast(0 as bit)
end as doesItMatch
) as criteria(doesItMatch)
I really discourage you from using this as a search criteria.
-- Queries the table over this criteria.
-- This is going to compute over your entire table, so it can get very CPU intensive
select *
from #example e
where
-- If it *must* have at least a zero
e.something like '%0%' and
-- Eliminates zeroes
replace(
-- Replaces the allowed characters with zero
translate(
e.something
,'_./'
,'000'
)
,'0'
,''
) = ''
If you must use this as a search criteria, and this will be a common filter on your application, I suggest you create a new bit column, to flag if it matches this, and index it. Thus, the increase in computational effort would be spread on the inserts/updates/deletes, and the search queries won't overloading the database.
The code can be seen executing here, on DB Fiddle.
What I got from the question is that the strings must contain both 0 and any combination of the special characters in the string.
If you have SQL Server 2017 and above, you can use translate() to replace multiple characters with a space and compare this with the empty string. Also you can use LIKE to enforce that both a 0 and any combination of the special character(s) appear at least once:
DECLARE #temp TABLE (val varchar(100))
INSERT INTO #temp VALUES
('0__0'), ('000/00/00000'), ('.00000'), ('w0hee/'), ('./')
SELECT *
FROM #temp
WHERE val LIKE '%0%' --must have at least one zero somewhere
AND val LIKE '%[_/.]%' --must have at least one special character(s) somewhere
AND TRANSLATE(val, '0./_', ' ') = '' --translated zeros and sp characters to spaces equivalent to an empty string
Creates output:
val
0__0
000/00/00000
.00000

Regex string with 2+ different numbers and some optional characters in Snowflake syntax

I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!

String concatenation based of column length

i have telephone number like this in one table:
ID Telephone extention
------------------------------
1 9986323422 4
2 9992108 2222
3 9962718 241
Final result wanted is number of digit in extention will be taken and replace the end digit/(s) of "Telephone" column.
want my result to be:
ID Telephone extention result
-----------------------------------------
1 9986323422 4 9986323424
2 9992108 2222 9992222
3 9962718 241 9962241
I have 100k records like this. What is the best and quick way to achieve this? Thanks.
This may be a little too cute1 but is an alternative to the STUFF approaches:
SELECT ID,Telephone,Extension,
SUBSTRING(Telephone,1-LEN(Extension),LEN(Telephone)) + Extension as Result
It works because negative arguments to the start parameter for SUBSTRING allow you to truncate the end of the string by those amounts.
1It avoid repetitive calls to LEN(), but the optimizer should be able to avoid duplication anyway and avoids having to reverse the entire string, but this does come at a readability cost.
You can use STUFF() together with some calculations with LEN()
DECLARE #dummyTable TABLE(ID INT,Telephone VARCHAR(100), extention VARCHAR(100));
INSERT INTO #dummyTable VALUES
(1,'9986323422','4')
,(2,'9992108','2222')
,(3,'9962718','241');
SELECT *
,STUFF(t.Telephone,LEN(t.Telephone)-LEN(t.extention)+1,LEN(t.extention),t.extention) AS result
FROM #dummyTable AS t;
You might have to add some validations to avoid errors (e.g. length of extension should be smaller than of phone number)
In similar way use reverse() function with stuff() function to replace ends digits of Telephone value with extention value
select *, reverse(stuff(reverse(Telephone), 1, len(extention), reverse(extention)))
from table

SQL: Adding number while REPLACING string

Good evening,
I need to replace a "part of a string" in a SQL column (Column3), I'm using the REPLACE build-in function to accomplish this, but I need to ADD a leading number (1) to the original string (Column2), and I keep getting "String or binary data would be truncated."
UPDATE [database].[dbo].[Table1]
SET [Column3] = REPLACE(Column3, Column2, ('1' + Column2))
One example:
Column2: "0200"
Here is an example of what COLUMN3 string looks like:
Column3: "TEST DATA 0200"
Then after it gets replaced we need to show it like this: "TEST DATA 10200"
Notice the number now includes a leading "1"
HELP PLEASE!!!
Maybe this REPLACE function replaces in a recursive loop until it blows the size of the string. Let's say Column3 is a varchar(10), and contains "abcd". Column2 contains "b". Then, it will replace all occurrences of Column2 in Column3 for '1' + Column2.
First replacement ('b' for '1b'):
"a1bcd"
Second replacement:
"a11bcd"
It keeps going...
"a111bcd"
"a1111bcd"
"a11111bcd"
"a111111bcd"
Next time it will try to put a 11 character string on a varchar(10).
You should define your own function (using T-SQL) to run the string only once and, after replace, continue from after the replacement and not from the right next character.
The error message you are getting indicates that the resulting string is too big to fit in the "Data" column. You need to increase the size of "Data" to ensure the result will fit.
Try running this SELECT (instead of the UPDATE) to see what is happening
SELECT
REPLACE(Column3, Column2, ('1' + Column2)) AS Result,
DATALENGTH(REPLACE(Column3, Column2, ('1' + Column2))) AS ResultSize
FROM
[database].[dbo].[Table1]

Sql Server's regex LIKE - behaviour clarification?

Someone asked here how to get only values which are a number :
So , if the table is :
DECLARE #Table TABLE(
Col nVARCHAR(50)
)
INSERT INTO #Table SELECT 'ABC'
INSERT INTO #Table SELECT '234.62'
INSERT INTO #Table SELECT '10:10:10:10'
INSERT INTO #Table SELECT 'France'
INSERT INTO #Table SELECT '2'
then - the desired results are :
234.62
2
But when I tested this query :
SELECT * FROM #Table WHERE Col LIKE '%[0-9.]%' --expected to see only 234.62
it showed :
234.62
10:10:10:10
2
Question #1
How come 10:10:10:10 , 2 satisfies the condition ?
Question #2
I saw this answer here which does work
SELECT * FROM #Table WHERE Col NOT LIKE '%[^0-9.]%'
But I don't understand why this works. AFAIU - it selects all values which are not like (not(has number) and not( has dot)) which is ===>(de morgan)===> not like ( has number or has dot)
Can someone please shed light ?
nb I already know that isnumeric can be used also , but it's unsafe (+). also valid wildcards are %,_,[],[^]
Any particular use of [set] within a LIKE expression is a check against one character in the target string.
So, LIKE '%[0-9.]%' says - % - match 0-to-many arbitrary characters, then [0-9.] match one character in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character in the set 0-9.". So, 10:10:10:10 can be matched as 0 arbitrary characters, then 1 matches [0-9.], and then 0:10:10:10 matches the final %.
LIKE '%[^0-9.]%' says - % - match 0-to-many arbitrary characters, then [^0-9.] match one character not in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character outside of the set 0-9.. So when we apply the NOT to the front of that, we are saying "match any string that doesn't contain at least one character outside of the set 0-9." or "match strings that only contain characters in the set 0-9..
Essentially, the double-negative is a way to make an assertion about all characters in the string.

Resources