Snowflake - REMOVE duplicated words/strings within a string in snowflake - snowflake-cloud-data-platform

i have this code that works in oracle, to a certain degree
with data(str) as (
select '3|BUTE PLACE|BUTE PLACE' from dual union all
select '3 BUTE PLACE BUTE PLACE' from dual union all
select '3 BUTE PLACE BUTE-PLACE' from dual)
select str, rtrim(str_new, ' ') new_str
from data
model
partition by (rownum rn)
dimension by (0 dim)
measures(str, str||' ' str_new)
rules
iterate(10000) until (str_new[0] = previous(str_new[0]))
(str_new[0]=regexp_replace(str_new[0],'(^| )([^ ]+ )(.*? )?\2+','\1\2\3'))
what i'm trying to figure out is how to use this code in snowflake where my address line is | separated and i want to be able to turn '3|BUTE PLACE|BUTE PLACE' into '3|BUTE PLACE'
for the purpose of address matching.
Thanks

Related

LIKE '%[9-15]%' (SQL Server)

I'm using SQL Server 2017 and I would like to ask if it's possible to use the LIKE like operator as follows:
LIKE '%Ticket [8-14]%'
Is this correct or the numbers greater than 9 (10,11, etc etc) will be identified as 1 and 0, 1 and 1, 1 and 2 etc etc.
If this way doesn't work, what can i do to select all the data that contain strings like 'Ticket 10', 'Ticket11' and so on..?
Thank you for your time
As far as I know, no. However, You can do something like:
WITH Demo AS
(
SELECT * FROM (VALUES
('Ticket 1'),
('Ticket 7'),
('Ticket 8'),
('Ticket 10'),
('Ticket 12'),
('Ticket 15')
) T(X)
)
SELECT *
FROM Demo
WHERE X LIKE '%Ticket [8-9]%' OR X LIKE '%Ticket 1[0-4]%'
Also, consider normalization - create TicketNumber column if you need to query over this value. It's much easier to concatenate Ticket and number than parse string. TicketNumber could also be easily indexed if needed.
There is also more clever idea to parse numbers:
WITH Demo AS
(
SELECT * FROM (VALUES
('Ticket 1'),
('Ticket 7'),
('My Ticket 8A'),
('Ticket 10'),
('Some Ticket 12'),
('Ticket 15 other text'),
('Ticket 135 and more')
) T(X)
)
SELECT *, CAST(CASE WHEN PATINDEX('%Ticket [0-9][0-9][0-9]%',X)!=0 THEN SUBSTRING(X, PATINDEX('%Ticket [0-9][0-9][0-9]%',X)+7, 3)
WHEN PATINDEX('%Ticket [0-9][0-9]%',X)!=0 THEN SUBSTRING(X, PATINDEX('%Ticket [0-9][0-9]%',X)+7, 2)
WHEN PATINDEX('%Ticket [0-9]%',X)!=0 THEN SUBSTRING(X, PATINDEX('%Ticket [0-9]%',X)+7, 1)
END AS int) Number
FROM Demo
Column number should now contain simple int value - ready to compare and take part in calculations.
'%Ticket [8-14]%'
The brackets are used to specify a single character, that's usually specified as a lower and upper range like "[0-9]" or "[a-z]".
Your string would match:
"Ticket 1" through "Ticket 8". The 4 would be ignored because it's already handled by the 8-1 range.
It would not match "Ticket 0" or "Ticket 9" or "Ticket 10".
A slightly different approach to Pawel's, but a very similar idea:
SELECT *
FROM YourTable YT
CROSS APPLY(VALUES('8'),('9'),('10'),('11'),('12'),('13'),('14')) V(TN)
WHERE YT.YourColumn LIKE '%Ticket ' + V.TN + '%';
If you're using this as a Stored Procedure, you could use a table-value paramter to hold the data instead. Something like:
CREATE TYPE numbers AS table (Number int);
GO
CREATE PROC YourProc #TicketNumbers numbers READONLY AS
SELECT *
FROM YourTable YT
CROSS JOIN #TicketNumbers TN
WHERE YT.YourColumn LIKE 'Ticket ' + CONVERT(varchar(3),TN.Number) + '%';
GO

Expression to find multiple spaces in string

We handle a lot of sensitive data and I would like to mask passenger names using only the first and last letter of each name part and join these by three asterisks (***),
For example: the name 'John Doe' will become 'J***n D***e'
For a name that consists of two parts this is doable by finding the space using the expression:
LEFT(CardHolderNameFromPurchase, 1) +
'***' +
CASE WHEN CHARINDEX(' ', PassengerName) = 0
THEN RIGHT(PassengerName, 1)
ELSE SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) -1, 1) +
' ' +
SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) +1, 1) +
'***' +
RIGHT(PassengerName, 1)
END
However, the passenger name can have more than two parts, there is no real limit to it. How should can I find the indices of all spaces within an expression? Or should I maybe tackle this problem in a different way?
Any help or pointer is much appreciated!
This solution does what you want it to, but is really the wrong approach to use when trying to hide personally identifiable data, as per Gordon's explanation in his answer.
SQL:
declare #t table(n nvarchar(20));
insert into #t values('John Doe')
,('JohnDoe')
,('John Doe Two')
,('John Doe Two Three')
,('John O''Neill');
select n
,stuff((select ' ' + left(s.item,1) + '***' + right(s.item,1)
from dbo.fn_StringSplit4k(t.n,' ',null) as s
for xml path('')
),1,1,''
) as mask
from #t as t;
Output:
+--------------------+-------------------------+
| n | mask |
+--------------------+-------------------------+
| John Doe | J***n D***e |
| JohnDoe | J***e |
| John Doe Two | J***n D***e T***o |
| John Doe Two Three | J***n D***e T***o T***e |
| John O'Neill | J***n O***l |
+--------------------+-------------------------+
String splitting function based on Jeff Moden's Tally Table approach:
create function [dbo].[fn_StringSplit4k]
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return, null returns all.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
GO
If you consider PassengerName as sensitive information, then you should not be storing it in clear text in generally accessible tables. Period.
There are several different options.
One is to have reference tables for sensitive information. Any table that references this would have an id rather than the name. Viola. No sensitive information is available without access to the reference table, and that would be severely restricted.
A second method is a reversible compression algorithm. This would allow the the value to be gibberish, but with the right knowledge, it could be transformed back into a meaningful value. Typical methods for this are the public key encryption algorithms devised by Rivest, Shamir, and Adelman (RSA encoding).
If you want to do first and last letters of names, I would be really careful about Asian names. Many of them consist of two or three letters, when written in Latin script. That isn't much hiding. SQL Server does not have simple mechanisms to do this. You can write a user-defined function with a loop to manager the process. However, I view this as the least secure and least desirable approach.
This uses Jeff Moden's DelimitedSplit8K, as well as the new functionality in SQL Server 2017 STRING_AGG. As I don't know what version you're using, I've just gone "whole hog" and assumed you're using the latest version.
Jeff's function is invaluable here, as it returns the ordinal position, something which Microsoft have foolishly omitted from their own function, STRING_SPLIT (and didn't add in 2017 either). Ordinal position is key here, so we can't make use of the built in function.
WITH VTE AS(
SELECT *
FROM (VALUES ('John Doe'),('Jane Bloggs'),('Edgar Allan Poe'),('Mr George W. Bush'),('Homer J Simpson')) V(FullName)),
Masking AS (
SELECT *,
ISNULL(STUFF(Item, 2, LEN(item) -2,'***'), Item) AS MaskedPart
FROM VTE V
CROSS APPLY dbo.delimitedSplit8K(V.Fullname, ' '))
SELECT STRING_AGG(MaskedPart,' ') AS MaskedFullName
FROM Masking
GROUP BY Fullname;
Edit: Nevermind, OP has commented they are using 2008, so STRING_AGG is out of the question. #iamdave, however, has posted an answer which is very similar to my own, just do it the "old fashioned XML way".
Depending on your version of SQL Server, you may be able to use the built-in string split to rows on spaces in the name, do your string formatting, and then roll back up to name level using an XML path.
create table dataset (id int identity(1,1), name varchar(50));
insert into dataset (name) values
('John Smith'),
('Edgar Allen Poe'),
('One Two Three Four');
with split as (
select id, cs.Value as Name
from dataset
cross apply STRING_SPLIT (name, ' ') cs
),
formatted as (
select
id,
name,
left(name, 1) + '***' + right(name, 1) as out
from split
)
SELECT
id,
(SELECT ' ' + out
FROM formatted b
WHERE a.id = b.id
FOR XML PATH('')) [out_name]
FROM formatted a
GROUP BY id
Result:
id out_name
1 J***n S***h
2 E***r A***n P***e
3 O***e T***o T***e F***r
You can do that using this function.
create function [dbo].[fnMaskName] (#var_name varchar(100))
RETURNS varchar(100)
WITH EXECUTE AS CALLER
AS
BEGIN
declare #var_part varchar(100)
declare #var_return varchar(100)
declare #n_position smallint
set #var_return = ''
set #n_position = 1
WHILE #n_position<>0
BEGIN
SET #n_position = CHARINDEX(' ', #var_name)
IF #n_position = 0
SET #n_position = LEN(#var_name)
SET #var_part = SUBSTRING(#var_name, 1, #n_position)
SET #var_name = SUBSTRING(#var_name, #n_position+1, LEN(#var_name))
if #var_part<>''
SET #var_return = #var_return + stuff(#var_part, 2, len(#var_part)-2, replicate('*',len(#var_part)-2)) + ' '
END
RETURN(#var_return)
END

sort float numbers as a natural numbers in SQL Server

Well I had asked the same question for jquery on here, Now my question is same with SQL Server Query :) But this time this is not comma separated, this is separate row in Database like
I have separated rows having float numbers.
Name
K1.1
K1.10
K1.2
K3.1
K3.14
K3.5
and I want to sort this float numbers like,
Name
K1.1
K1.2
K1.10
K3.1
K3.5
K3.14
actually in my case, the numbers which are after decimals will consider as a natural numbers, so 1.2 will consider as '2' and 1.10 will consider as '10' thats why 1.2 will come first than 1.10.
You can remove 'K' because it is almost common and suggestion or example would be great for me, thanks.
You can use PARSENAME (which is more of a hack) or String functions like CHARINDEX , STUFF, LEFT etc to achieve this.
Input data
;WITH CTE AS
(
SELECT 'K1.1' Name
UNION ALL SELECT 'K1.10'
UNION ALL SELECT 'K1.2'
UNION ALL SELECT 'K3.1'
UNION ALL SELECT 'K3.14'
UNION ALL SELECT 'K3.5'
)
Using PARSENAME
SELECT Name,PARSENAME(REPLACE(Name,'K',''),2),PARSENAME(REPLACE(Name,'K',''),1)
FROM CTE
ORDER BY CONVERT(INT,PARSENAME(REPLACE(Name,'K',''),2)),
CONVERT(INT,PARSENAME(REPLACE(Name,'K',''),1))
Using String Functions
SELECT Name,LEFT(Name,CHARINDEX('.',Name) - 1), STUFF(Name,1,CHARINDEX('.',Name),'')
FROM CTE
ORDER BY CONVERT(INT,REPLACE((LEFT(Name,CHARINDEX('.',Name) - 1)),'K','')),
CONVERT(INT,STUFF(Name,1,CHARINDEX('.',Name),''))
Output
K1.1 K1 1
K1.2 K1 2
K1.10 K1 10
K3.1 K3 1
K3.5 K3 5
K3.14 K3 14
This works if there is always one char before the first number and the number is not higher than 9:
SELECT name
FROM YourTable
ORDER BY CAST(SUBSTRING(name,2,1) AS INT), --Get the number before dot
CAST(RIGHT(name,LEN(name)-CHARINDEX('.',name)) AS INT) --Get the number after the dot
Perhaps, more verbal, but should do the trick
declare #source as table(num varchar(12));
insert into #source(num) values('K1.1'),('K1.10'),('K1.2'),('K3.1'),('K3.14'),('K3.5');
-- create helper table
with data as
(
select num,
cast(SUBSTRING(replace(num, 'K', ''), 1, CHARINDEX('.', num) - 2) as int) as [first],
cast(SUBSTRING(replace(num, 'K', ''), CHARINDEX('.', num), LEN(num)) as int) as [second]
from #source
)
-- Select and order accordingly
select num
from data
order by [first], [second]
sqlfiddle:
http://sqlfiddle.com/#!6/a9b06/2
The shorter solution is this one :
Select Num
from yourtable
order by cast((Parsename(Num, 1) ) as Int)

listagg data to useable format?

This is my first time working with the LISTAGG function and I'm confused. I can select the data easily enough, but the characters of the USERS column all have spaces in between them, and when trying to copypaste it, no data from that column is copied. I've tried with two different IDEs. Am I doing something wrong?
Example:
select course_id, listagg(firstname, ', ') within group (order by course_id) as users
from (
select distinct u.firstname, u.lastname, u.student_id, cm.course_id
from course_users cu
join users u on u.pk1 = cu.users_pk1
join course_main cm on cm.pk1 = cu.crsmain_pk1
and cm.course_id like '2015SP%'
)
group by course_id;
Yields:
I had similar problem, it turned out that the problem was with encoding. I got this solved like this (change to another encoding if needed):
...listagg(convert(firstname, 'UTF8', 'AL16UTF16'), ', ')...
Your firstname column seems to be defined as nvarchar2:
with t as (
select '2015SP.BOS.PPB.556.A'as course_id,
cast('Alissa' as nvarchar2(10)) as firstname
from dual
union all select '2015SP.BOS.PPB.556.A'as course_id,
cast('Dorothea' as nvarchar2(10)) as firstname
from dual
)
select course_id, listagg(firstname, ', ')
within group (order by course_id) as users
from t
group by course_id;
COURSE_ID USERS
-------------------- ------------------------------
2015SP.BOS.PPB.556.A
... and I can't copy/paste the users values from SQL Developer either, but it displays with spaces, as you can see from SQL*Plus:
COURSE_ID USERS
-------------------- ------------------------------
2015SP.BOS.PPB.556.A A l i s s a, D o r o t h e a
As the documentation says, the listagg() function always returns varchar2 (or raw), so passing in an nvarchar2 value causes an implicit conversion which is throwing out your results.
If you're stuck with your column being of that data type, you could cast it to varchar2 inside the listagg call:
column users format a30
with t as (
select '2015SP.BOS.PPB.556.A'as course_id,
cast('Alissa' as nvarchar2(10)) as firstname
from dual
union all select '2015SP.BOS.PPB.556.A'as course_id,
cast('Dorothea' as nvarchar2(10)) as firstname
from dual
)
select course_id, listagg(cast(firstname as varchar2(10)), ', ')
within group (order by course_id) as users
from t
group by course_id;
COURSE_ID USERS
-------------------- ------------------------------
2015SP.BOS.PPB.556.A Alissa, Dorothea
But you probably don't really want it to be nvarchar2 at all.
Apparently it's a known (unresolved?) bug in 11. TO_CHAR() worked for me...
SELECT wiporderno, LISTAGG(TO_CHAR(medium), ',') WITHIN GROUP(ORDER BY wiporderno) AS jobclassification
...where medium was the problematic column/data type.

Can we compare LHS = RHS in query

I have a query where I input username as a single string:
'MONTY,JONTY'
My query part looks like:
SELECT *
FROM tbl_dummy
WHERE username IN (SELECT regexp_substr(:username, '[^,]+', 1, level)
FROM dual
CONNECT BY regexp_substr(:username, '[^,]+', 1, level) IS NOT NULL);
Here my column 'Username' will have data like:
Monty, Jonty
Monty
Jonty
Jonty, Monty
So when I pass my string, it will split i.e 'Monty', 'Jonty' and
the query comparison will miss two values "Monty, Jonty" and "Jonty, Monty" out of 4 rows.
If i was able to split my column value while comparison, then i could have proved LHS = RHS.
So it would be ('JONTY','MONTY') = ('MONTY','JONTY')
Is there any way this functionality can be achieved ? I cannot write stored procedure so it has to be an oracle query.
Also, Has anyone used RegExp_Like for such thing ? I am not able to find a syntax which would fit this code.
If I understand your question correctly, and if your comma-delimited values aren't too long (regexes in Oracle are limited to 512 bytes IIRC), you can replace the comma-delimited list with a pipe (|)-delimited list and use `REGEXP_LIKE()' as follows:
WITH u1 AS (
SELECT 'MONTY, JONTY' AS username FROM dual
UNION
SELECT 'MONTY' AS username FROM dual
UNION
SELECT 'JONTY' AS username FROM dual
UNION
SELECT 'JONTY, MONTY' AS username FROM dual
), u2 AS (
SELECT TRIM(REGEXP_SUBSTR('JONTY, MONTY','[^,]+', 1, LEVEL)) AS username FROM dual
CONNECT BY REGEXP_SUBSTR('JONTY, MONTY', '[^,]+', 1, LEVEL) IS NOT NULL
)
SELECT * FROM u1
WHERE EXISTS (
SELECT 1 FROM u2
WHERE REGEXP_LIKE(u2.username, '^(' || REGEXP_REPLACE(u1.username, '\s*,\s*', '|') || ')$', 'i')
)
Simply use LIKE operator.
where column_name like '℅MONTY℅' or column_name like '℅JONTY℅'
You are trying to use varying-IN list in the predicate. In your case, 'MONTY, JONTY' is a single string.

Resources