How to ignore all whitespace characters and punctuations in snowflake - snowflake-cloud-data-platform

The below query working for one string. However when I run at whole table data it's not working
select
lower( regexp_replace( nvl(column1,':'), '\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+', '')) as addres_line_1,
column1
from values('122 E 7th Street ');
output: 122e7thstreet
when I run similar query for the table, the white spaces are not fully removed.
output: 122e7th street
table level query:
select concat(
column1, ':',
column2, ':',
lower(regexp_replace(regexp_replace(nvl(ADDRESS_LINE_1,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+',''),'[ \t\r\n\v\f]+','')), ':',
lower(regexp_replace(nvl(ADDRESS_LINE_2,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(ADDRESS_LINE_3,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(PRIMARY_TOWN,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(COUNTRY,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(TERRITORY_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(POSTAL_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+|\\s+','')), ':',
lower(regexp_replace(nvl(COUNTRY_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+',''))
) as ROLE_PLAYER_ADDRESS_HASH_KEY1
from address

Does using regex expressions like this one work work:
REGEXP_REPLACE(REGEXP_REPLACE(ADDRESS_LINE_1, '[^\\w]'),'_')
\w is any digit, letter, or underscore - hence the need for an outer REGEXP_REPLACE to remove underscores

So you have three different regex,
you single demostraition regex is:
'\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+'
then you use:
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+'
for the table except for POSTAL_CODE which uses:
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+|\\s+'
Thus do they work equally well on you example input, or other inputs you see failure on?
select
'\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+' as reg_1,
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+' as reg_2,
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+|\\s+' as reg3,
lower( regexp_replace( nvl(column1,':'), reg_1, '')) as addres_1,
lower( regexp_replace( nvl(column1,':'), reg_2, '')) as addres_2,
lower( regexp_replace( nvl(column1,':'), reg3, '')) as addres_3,
column1
from values
('122 E 7th Street '),
('122 E 7th Street '),
(' 122 E 7th Street ')
-- etc, etc
;
alrighty, so the token you have is C2 A0 \u00a0 the no-break-space token.
select
'\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?#\^_`{|}~-]+' as reg_1,
column1,
regexp_replace( column1, reg_1, '') as out
from values
('122 e 7th\u00a0street')
gives:
REG_1: \s+|[][!"#$%&'()*+,.\\/:;<=>?#^_`{|}~-]+
COLUMN1: 122 e 7th street
OUT: 122e7th street
and if you copy that output the space has converted to a normal 0x20 space.
So now we know what the input data is we just need to match it.
so you can use the TRANSLATE function to remove the unicode character via
select
column1
,regexp_replace( column1, '\\s+', '') as r1
,translate(column1,'\u00a0',' ') as t1 as t1
,regexp_replace( t1, '\\s+', '') as r
from values ('122 e 7th\u00a0street');
which means we can just put \u00a0 into the regex
regexp_replace( column1, '\\s+|\u00a0+', '')
works a charm!

Related

How do I change ;A;B;C; to ('A', 'B', 'C')?

I need to convert a value list separated by semi-colons, including in front, into a regular value list with quotes and commas. There might only be one value, or there may be many values in the field.
I thought about replacing the ; with a comma, but then I still have a comma in front and behind, and I also need to add single quotes.
REPLACE(S_List, ';', ',')
I want ;a;b;c; to be 'a','b','c' or at least a,b,c but I don't know what to do with the beginning and end semicolons
With substring() and replace():
declare #slist varchar(100) = ';a;b;c;'
select substring(replace(#slist, ';', ''','''), 3, len(replace(#slist, ';', ''',''')) - 4)
See the demo.
Result:
'a','b','c'
Edit.
Use it like this in your table:
select
case when s_list like '%;%;%' then
substring(replace(s_list, ';', ''','''), 3, len(replace(s_list, ';', ''',''')) - 4)
else s_list
end
from tablename
See the demo.
You could try:
SELECT REPLACE(LEFT(S_List,1),';','''')
SELECT REPLACE(RIGHT(S_List,1),';','''')
SELECT REPLACE(S_List,';', ''',''')
If your values do not have spaces, you can use a trim() trick:
select '''' + replace(ltrim(rtrim(replace(s_list, ',', ' '))), ',', ''',''') + ''''

How to remove spaces between comma or numbers in T-SQL?

SELECT REPLACE('10,6 7 7,900 11,027,900', ' ', '')
SELECT REPLACE('10,2 27,900 10,6 7 7,900 11,027,900', ' ', '')
Bad Result:
10,677,90011,027,900
10,227,90010,677,90011,027,900
Good Result:
10,677,900 11,027,900
10,227,900 10,677,900 11,027,900
This is an odd requirement. Before this goes downhill, I suggest you normalize your table properly. Anyway, if you're stuck with what you have for now, here is a way to solve your problem.
First, you need a string splitter, to split strings by comma. I use DelimitedSplit8K, written by Jeff Moden and improved by the members of SQL Server Central community.
After splitting the string, check if the value of each item after the space is removed has a length of 3. If yes, concatenate the new string (space removed). Else, concatenate the original item.
WITH Tbl(OriginalString) AS(
SELECT '10,6 7 7,900 11,027,900' UNION ALL
SELECT '10,2 27,900 10,6 7 7,900 11,027,900'
),
TblSplitted(originalString, ItemNumber, Item) AS (
SELECT *
FROM Tbl t
CROSS APPLY dbo.DelimitedSplit8K(t.OriginalString, ',')
)
SELECT *
FROM Tbl t
CROSS APPLY(
SELECT STUFF((
SELECT ',' +
CASE
WHEN LEN(REPLACE(s.Item, ' ', '')) = 3 THEN REPLACE(s.Item, ' ', '')
ELSE s.Item
END
FROM TblSplitted s
WHERE s.originalString = t.OriginalString
ORDER BY s.ItemNumber
FOR XML PATH('')
), 1, 1, '')
) x(NewString);

SSIS - How to apply replace function on all text columns in data flow

I have a data flow with over 150 columns and many of them are of data type string, I need to remove comma's and double quotes from the value of every text column because they are causing issues when I export the data to CSV, is there an easy way to do that other than doing it explicitly for every column in a derived column or script compnent?
In the script generator below, put all your column names from the CSV in the order that you want, and run it.
;With ColumnList as
(
Select 1 Id, 'FirstColumn' as ColumnName
Union Select 2, 'SecondColumn'
Union Select 3, 'ThirdColumn'
Union Select 4, 'FourthColumn'
Union Select 5, 'FifthColumn'
)
Select 'Trim (Replace (Replace (' + ColumnName + ', '','', ''''), ''"'', ''''))'
From ColumnList
Order BY Id
The Id column should contain a proper sequence (I would generate that in EXCEL. Here is the output
---------------------------------------------------------
Trim (Replace (Replace (FirstColumn, ',', ''), '"', ''))
Trim (Replace (Replace (SecondColumn, ',', ''), '"', ''))
Trim (Replace (Replace (ThirdColumn, ',', ''), '"', ''))
Trim (Replace (Replace (FourthColumn, ',', ''), '"', ''))
Trim (Replace (Replace (FifthColumn, ',', ''), '"', ''))
(5 row(s) affected)
You could just cut and paste from here into your SSIS dataflow.
I found a way to loop through columns in a script component, then i was able to check the column data type and do a replace function, here is the post i used.

SQL: Concatenate column values in a single row into a string separated by comma

Let's say I have a table like this in SQL Server:
Id City Province Country
1 Vancouver British Columbia Canada
2 New York null null
3 null Adama null
4 null null France
5 Winnepeg Manitoba null
6 null Quebec Canada
7 Seattle null USA
How can I get a query result so that the location is a concatenation of the City, Province, and Country separated by ", ", with nulls omitted. I'd like to ensure that there aren't any trailing comma, preceding commas, or empty strings. For example:
Id Location
1 Vancouver, British Columbia, Canada
2 New York
3 Adama
4 France
5 Winnepeg, Manitoba
6 Quebec, Canada
7 Seattle, USA
I think this takes care of all of the issues I spotted in other answers. No need to test the length of the output or check if the leading character is a comma, no worry about concatenating non-string types, no significant increase in complexity when other columns (e.g. Postal Code) are inevitably added...
DECLARE #x TABLE(Id INT, City VARCHAR(32), Province VARCHAR(32), Country VARCHAR(32));
INSERT #x(Id, City, Province, Country) VALUES
(1,'Vancouver','British Columbia','Canada'),
(2,'New York' , null , null ),
(3, null ,'Adama' , null ),
(4, null , null ,'France'),
(5,'Winnepeg' ,'Manitoba' , null ),
(6, null ,'Quebec' ,'Canada'),
(7,'Seattle' , null ,'USA' );
SELECT Id, Location = STUFF(
COALESCE(', ' + RTRIM(City), '')
+ COALESCE(', ' + RTRIM(Province), '')
+ COALESCE(', ' + RTRIM(Country), '')
, 1, 2, '')
FROM #x;
SQL Server 2012 added a new T-SQL function called CONCAT, but it is not useful here, since you still have to optionally include commas between discovered values, and there is no facility to do that - it just munges values together with no option for a separator. This avoids having to worry about non-string types, but doesn't allow you to handle nulls vs. non-nulls very elegantly.
select Id ,
Coalesce( City + ',' +Province + ',' + Country,
City+ ',' + Province,
Province + ',' + Country,
City+ ',' + Country,
City,
Province,
Country
) as location
from table
This is a hard problem, because the commas have to go in-between:
select id, coalesce(city+', ', '')+coalesce(province+', ', '')+coalesce(country, '')
from t
seems like it should work, but we can get an extraneous comma at the end, such as when country is NULL. So, it needs to be a bit more complicated:
select id,
(case when right(val, 2) = ', ' then left(val, len(val) - 1)
else val
end) as val
from (select id, coalesce(city+', ', '')+coalesce(province+', ', '')+coalesce(country, '') as val
from t
) t
Without a lot of intermediate logic, I think the simplest way is to add a comma to each element, and then remove any extraneous comma at the end.
Use the '+' operator.
Understand that null values don't work with the '+' operator (so for example: 'Winnepeg' + null = null), so be sure to use the ISNULL() or COALESCE() functions to replace nulls with an empty string, e.g.: ISNULL('Winnepeg','') + ISNULL(null,'').
Also, if it is even remotely possible that one of your collumns could be interpreted as a number, then be sure to use the CAST() function as well, in order to avoid error returns, e.g.: CAST('Winnepeg' as varchar(100)).
Most of the examples so far neglect one or more pieces of this. Also -- some of the examples use subqueries or do a length check, which you really ought not to do -- just not necessary -- though your optimizer might save you anyway if you do.
Good Luck
ugly but it will work for MS SQL:
select
id,
case
when right(rtrim(coalesce(city + ', ','') + coalesce(province + ', ','') + coalesce(country,'')),1)=',' then left(rtrim(coalesce(city + ', ','') + coalesce(province + ', ','') + coalesce(country,'')),LEN(rtrim(coalesce(city + ', ','') + coalesce(province + ', ','') + coalesce(country,'')))-1)
else rtrim(coalesce(city + ', ','') + coalesce(province + ', ','') + coalesce(country,''))
end
from
table
I know it's an old question, but should someone should stumble upon this today, SQL Server 2017 and later has the STRING_AGG function, with the WITHIN GROUP option :
with level1 as
(select id,city as varcharColumn,1 as columnRanking from mytable
union
select id,province,2 from mytable
union
select id,country,3 from mytable)
select STRING_AGG(varcharColumn,', ')
within group(order by columnRanking)
from level1
group by id
Should empty strings exist aside of nulls, they should be excluded with some WHERE clause in level1.
Here is an option:
SELECT (CASE WHEN City IS NULL THEN '' ELSE City + ', ' END) +
(CASE WHEN Province IS NULL THEN '' ELSE Province + ', ' END) +
(CASE WHEN Country IS NULL THEN '' ELSE Country END) AS LOCATION
FROM MYTABLE

T-SQL string manipulation, replacement, comparing, pattern matching, regular expressions

I have a short string of alphanumeric characters A-Z and 0-9
Both Characters AND Numbers are included in the string.
I want to strip spaces, and compare each string against a 'pattern' of which it will match only one. The Patterns use A to denote any character A-Z and 9 for any 0-9.
The 6 patterns are:
A99AA
A999AA
A9A9AA
AA99AA
AA999AA
AA9A9AA
I have these in a table with another column, with the correct space in place :-
pattern PatternTrimmed
A9 9AA A99AA
A99 9AA A999AA
A9A 9AA A9A9AA
AA9 9AA AA99AA
AA99 9AA AA999AA
AA9A 9AA AA9A9AA
I am using SQL Server 2005, and I don't want to have 34 replace statements changing each of the characters and numbers to A's and 9's.
Suggestions on how I can achieve this in a short succinct way, please.
Here's what I want to avoid :-
update postcodes set Pattern = replace (Pattern, 'B', 'A')
update postcodes set Pattern = replace (Pattern, 'C', 'A')
update postcodes set Pattern = replace (Pattern, 'D', 'A')
update postcodes set Pattern = replace (Pattern, 'E', 'A')
etc.
and
update postcodes set Pattern = replace (Pattern, '0', '9')
update postcodes set Pattern = replace (Pattern, '1', '9')
update postcodes set Pattern = replace (Pattern, '2', '9')
etc
Basically, I am trying to take a UK postcode typed in at a call centre by an imbecile, and pattern match the entered postcode against one of the 6 above patterns, and work out where to insert the space.
What about something like this:
Declare #table table
(
ColumnToCompare varchar(20),
AmendedValue varchar(20)
)
Declare #patterns table
(
Pattern varchar(20),
TrimmedPattern varchar(20)
)
Insert Into #table (ColumnToCompare)
Select 'BBB87 BBB'
Union all
Select 'J97B B'
union all
select '282 8289'
union all
select 'UW83 7YY'
union all
select 'UW83 7Y0'
Insert Into #patterns
Select 'A9 9AA', 'A99AA'
union all
Select 'A99 9AA', 'A999AA'
union all
Select 'A9A 9AA', 'A9A9AA'
union all
Select 'AA9 9AA', 'AA99AA'
union all
Select 'AA99 9AA', 'AA999AA'
union all
Select 'AA9A 9AA', 'AA9A9AA'
Update #table
Set AmendedValue = Left(Replace(ColumnToCompare, ' ',''), (CharIndex(' ', Pattern)-1)) + space(1) +
SubString(Replace(ColumnToCompare, ' ',''), (CharIndex(' ', Pattern)), (Len(ColumnToCompare) - (CharIndex(' ', Pattern)-1)))
From #table
Cross Join #Patterns
Where PatIndex(Replace((Replace(TrimmedPattern, 'A','[A-Z]')), '9','[0-9]'), Replace(ColumnToCompare, ' ' ,'')) > 0
select * From #table
This part
Left(Replace(ColumnToCompare, ' ',''), (CharIndex(' ', Pattern)-1))
finds the space in the pattern that has been matched and takes the left hand portion of the string being compared.
it then adds a space
+ space(1) +
then this part
SubString(Replace(ColumnToCompare, ' ',''), (CharIndex(' ', Pattern)), (Len(ColumnToCompare) - (CharIndex(' ', Pattern)-1)))
appends the remainder of the string to the new value.

Resources