Regex in SQL Server Replace function - sql-server

I have a variable with random text, let's say
DECLARE #sNumberFormat NVARCHAR(200) = 'rand{text.here,{999}also-Random9He8re'
I want to replace each 9 in {999} by [0-9]. So in this example I would like to get
'rand{text.here,[0-9][0-9][0-9]also-Random9He8re'
Problem is I never know how many 9 will be placed in brackets, so there can be {99} {9999} ..and go on. I also need to validate if there is any invalid character (not 9) then nothing should be replaced.
I have tried some combinations of REPLACE and PATINDEX functions, but I could not achieve that.

Sans robust regex support, SQL Server's native functions do not give much help here. One approach, a bit hackish, would be to separate the input string into three components:
rand{text.here,
{999}
also-Random9He8re
Next, replace the 9 in the middle target substring with #, or some other character which you don't expect to appear anywhere else in your input string:
rand{text.here,
{###}
also-Random9He8re
Finally, replace the # in the middle substring with [0-9] and then concatenate together to get the final result:
DECLARE #val NVARCHAR(200) = 'rand{text.here,{999}also-Random9He8re'
SELECT REPLACE(
SUBSTRING(#val, 1, CHARINDEX('{9', #val) - 1) +
REPLACE(SUBSTRING(#val,
CHARINDEX('{9', #val) + 1,
CHARINDEX('9}', #val) - CHARINDEX('{9', #val)), '9', '#') +
SUBSTRING(#val, CHARINDEX('9}', #val) + 2, LEN(#val) - CHARINDEX('9}', #val)),
'#', '[0-9]');

So the lazy dev in me suggests this:
SELECT Replace(
Replace(
Replace(
Replace(#input, '{9999}', '[0-9][0-9][0-9][0-9]')
, '{999}', '[0-9][0-9][0-9]')
, '{99}', '[0-9][0-9]')
, '{9}', '[0-9]') AS result
;
You can keep extending as long as you like to perform your (one off?) replacements.
Quick. Simple. Extensible. Hacky.
Sometimes lazy is good enough.

This could be done with CTE series. It works with an arbitrary number of "9" values in square brackets.
Declare #str varchar(max) = 'rand{text.here,{999}also-Random9He8re';
With A As
(Select 1 As Pos
Union All
Select Pos+1 As Pos From A Where Pos < LEN(#str)
),
B As (
Select STRING_AGG(Case When Chr Like '[{9}]' Then Chr Else ' ' End, '') As Chr
From A Cross Apply (Select SUBSTRING(#str,A.Pos,1 )) As T(chr)
),
C As (
Select [value] As pattern,
REPLACE(REPLACE(REPLACE([value], '9', '[0-9]'),'{',''),'}','') As replacement,
ROW_NUMBER() Over (ORDER BY (SELECT NULL)) As Num,
COUNT(*) OVER (ORDER BY (SELECT NULL)) As Cnt
From B Cross Apply STRING_SPLIT(Chr,' ')
Where [value] Like '{%}' And [value] Like '%9%'
),
D As (
Select #str As Result, 1 As Num
Union All
select REPLACE(Result, C.pattern, C.replacement) As Res , D.Num+1 As Num
From D Inner Join C On (D.Num=C.Num)
Where D.Num<=C.Cnt)
Select Top 1 Result
From D
Order by Num Desc
A - Getting a list of character positions in text
B - Getting text with spaces instead of characters other than
'9','{','}'
C- Getting patterns and corresponding replacement values
D - Getting the result using REPLACEMENT function

Related

TSQL: How insert separator between each character in a string

I have a string like this:
Apple
I want to include a separator after each character so the end result will turn out like this:
A,p,p,l,e
In C#, we have one liner method to achieve the above with Regex.Replace('Apple', ".{1}", "$0,");
I can only think of looping each character with charindex to append the separator but seems a little complicated. Is there any elegant way and simpler way to achieve this?
Thanks HABO for the suggestions. I'm able to generate the result that I want using the code but takes a little bit of time to really understand how the code work.
After some searching, I manage to found one useful article to insert empty spaces between each character and it's easier for me to understand.
I modify the code a little to define and include desire separator instead of fixing it to space as the separator:
DECLARE #pos INT = 2 -- location where we want first space
DECLARE #result VARCHAR(100) = 'Apple'
DECLARE #separator nvarchar(5) = ','
WHILE #pos < LEN(#result)+1
BEGIN
SET #result = STUFF(#result, #pos, 0, #separator);
SET #pos = #pos+2;
END
select #result; -- Output: A,p,p,l,e
Reference
In following SQL scripts, I get each character using SUBSTRING() function using with a number table (basically I used spt_values view here for simplicity) and then I concatenate them via two different methods, you can choose one
If you are using SQL Server 2017, we have a new SQL string aggregation function
First script uses string_agg function
declare #str nvarchar(max) = 'Apple'
SELECT
string_agg( substring(#str,number,1) , ',') Within Group (Order By number)
FROM master..spt_values n
WHERE
Type = 'P' and
Number between 1 and len(#str)
If you are working with a previous version, you can use string concatenation using FOR XML Path and SQL Stuff function as follows
declare #str nvarchar(max) = 'Apple'
; with cte as (
SELECT
number,
substring(#str,number,1) as L
FROM master..spt_values n
WHERE
Type = 'P' and
Number between 1 and len(#str)
)
SELECT
STUFF(
(
SELECT
',' + L
FROM cte
order by number
FOR XML PATH('')
), 1, 1, ''
)
Both solution yields the same result, I hope it helps
If you have SQL Server 2017 and a copy of ngrams8k it's ultra simple:
declare #word varchar(100) = 'apple';
select newString = string_agg(token, ',') within group (order by position)
from dbo.ngrams8k(#word,1);
For pre-2017 systems it's almost as simple:
declare #word varchar(100) = 'apple';
select newstring =
( select token + case len(#word)+1-position when 1 then '' else ',' end
from dbo.ngrams8k(#word,1)
order by position
for xml path(''))
One ugly way to do it is to split the string into characters, ideally using a numbers table, and reassemble it with the desired separator.
A less efficient implementation uses recursion in a CTE to split the characters and insert the separator between pairs of characters as it goes:
declare #Sample as VarChar(20) = 'Apple';
declare #Separator as Char = ',';
with Characters as (
select 1 as Position, Substring( #Sample, 1, 1 ) as Character
union all
select Position + 1,
case when Position & 1 = 1 then #Separator else Substring( #Sample, Position / 2 + 1, 1 ) end
from Characters
where Position < 2 * Len( #Sample ) - 1 )
select Stuff( ( select Character + '' from Characters order by Position for XML Path( '' ) ), 1, 0, '' ) as Result;
You can replace the select Stuff... line with select * from Characters; to see what's going on.
Try this
declare #var varchar(50) ='Apple'
;WITH CTE
AS
(
SELECT
SeqNo = 1,
MyStr = #var,
OpStr = CAST('' AS VARCHAR(50))
UNION ALL
SELECT
SeqNo = SeqNo+1,
MyStr = MyStR,
OpStr = CAST(ISNULL(OpStr,'')+SUBSTRING(MyStR,SeqNo,1)+',' AS VARCHAR(50))
FROM CTE
WHERE SeqNo <= LEN(#var)
)
SELECT
OpStr = LEFT(OpStr,LEN(OpStr)-1)
FROM CTE
WHERE SeqNo = LEN(#Var)+1

need patindex to match a-zA-Z0-9./-#' and space

I am cleaning up some data and would like to create a patindex that would reject any string contains any character(s) except for A-Za-z0-9./'-# and a space.
This rejects the special chars which should be allowed:
patindex ( '%[^A-Z0-9a-z./'-# ]%',stringtobetested )
Should I be masking the special chars? The bad and/or good chars can appear multiple times in a given string.
So where stringtobetested is abc#D-EF should pass but abc*def should fail.
This should work... it just uses replace to get around your escaping problems.
declare #stringtobetested1 varchar(64) = 'abc#D-EF'
declare #stringtobetested2 varchar(64) = 'abc*def '
select
#stringtobetested1 string1
,replace(replace(replace(replace(#stringtobetested1,'''','#'),' ','#'),'/','#'),'.','#') string1changed
,#stringtobetested2 string2
,replace(replace(replace(replace(#stringtobetested2,'''','#'),' ','#'),'/','#'),'.','#') string2changed
,patindex('%[^A-Z0-9a-z#-]%',replace(replace(replace(replace(#stringtobetested1,'''','#'),' ','#'),'/','#'),'.','#'))
,patindex('%[^A-Z0-9a-z#-]%',replace(replace(replace(replace(#stringtobetested2,'''','#'),' ','#'),'/','#'),'.','#'))
This can all be done with a PATINDEX, you just need some syntax help:
WITH test AS (
SELECT val
FROM (VALUES ('abc#D-EF./ -'''), ('abc*def')) AS t (val)
)
SELECT
input = t.val,
result = IIF(PATINDEX('%[^A-Z0-9a-z./''# -]%', t.val) > 0, 'fails', 'passes')
FROM test t;
First of all, the pattern itself is a string, and strings in T-SQL escape ' by doubling it to ''. Secondly, inside a [^ ] wildcard in a pattern, the - is used to define a character range when it occurs between two characters. By moving it to an end of the wildcard pattern, it is treated literally.
Other escape sequences specific to pattern wildcards can be found in this docs page: Pattern Matching in Search Conditions
Please run more testing for my query, and adjust max string length to fit your requirement.
DECLARE #TestString varchar(64) = 'abc#D-E/*F'
, #MaxStringLen INT = 20
;WITH cte AS(SELECT 1 number
UNION ALL
SELECT number + 1
FROM cte
WHERE number < #MaxStringLen
)
SELECT #TestString AS OriginalString
, CAST(CAST((SELECT SUBSTRING(#TestString, Number, 1)
FROM cte
WHERE Number <= LEN(#TestString) AND
SUBSTRING(#TestString, Number, 1) LIKE '%[A-Z0-9a-z-# ./'']%' FOR XML Path(''))
AS xml) AS varchar(MAX)) AS ConvertedString
, CASE WHEN #TestString = CAST(CAST((SELECT SUBSTRING(#TestString, Number, 1)
FROM cte
WHERE Number <= LEN(#TestString) AND
SUBSTRING(#TestString, Number, 1) LIKE '%[A-Z0-9a-z-# ./'']%' FOR XML Path(''))
AS xml) AS varchar(MAX))
THEN 1 ELSE 0 END IsAllowed

SQL Server : extracting number from a string

I have to following SQL command to extract only numbers from a string :
UPDATE Oesskattings
SET alfasorteer1 = CASE
WHEN CHARINDEX('-', blokno) > 0
THEN SUBSTRING(blokno + '-', 0, CHARINDEX('-', blokno))
ELSE SUBSTRING(blokno, PATINDEX('%[0-9]%', blokno), LEN(blokno))
END
My problem is when I have a record where blokno is eg 1B (conversion failed where number is followed by character).
How can I improve my code?
Regards
This will pull out all numbers from a string regardless of other characters and sequence.
It uses a Recursive CTE to identify numbers in order, then puts them back together with STUFF XML PATH
DROP TABLE #TMP
CREATE TABLE #TMP(ID INT IDENTITY(1,1),txt VARCHAR(20))
INSERT INTO #TMP VALUES
('q12w--e32w')
,('vfr45tgbnhy67')
,('12wq3&&r5f5')
,('1qw%%23er45t')
,('de32()ws2')
,('desfghj')
;WITH A
AS (
SELECT ID,1 POS
,txt
,SUBSTRING(txt,PATINDEX('%[1-9]%',txt),1) CHR
,RIGHT(txt,LEN(txt)-PATINDEX('%[1-9]%',txt)) REM
FROM #TMP
WHERE PATINDEX('%[1-9]%',txt) > 0
UNION ALL
SELECT ID,POS + 1
,txt
,SUBSTRING(REM,PATINDEX('%[1-9]%',REM),1) CHR
,RIGHT(REM,LEN(REM)-PATINDEX('%[1-9]%',REM)) REM
FROM
A
WHERE
PATINDEX('%[1-9]%',REM) > 0
)
,c AS
(
SELECT
ID,txt
,STUFF(
(SELECT ''+b.chr
FROM a b
WHERE a.ID = b.id
ORDER BY POS
FOR XML PATH('')),1,0,'') AS chrs
FROM
A
)
SELECT DISTINCT
*
FROM
C
What worked for me was to delete all non-numeric characters in the string with :
set alfasorteer1 = substring(blokno, patindex('%[0-9]%', blokno), 1+patindex('%[0-9][^0-9]%', blokno+'x')-patindex('%[0-9]%', blokno))

select only integer from right before certain character

I want a simple solution to select only numbers from right of a VARCHAR column
before a certain character.
For example, in the strings below, I only want to select numbers before slash character. The numbers vary, it can be 1 digit or more.
'ST/11/SCI/1' 'ST/11/SCI/22' 'ST/11/BIO/854' 'ST/11/BIO/5421'
You can use REVERSE with CHARINDEX so as to locate the position of the last '/' character. Then use RIGHT to extract the number:
DECLARE #str VARCHAR(100) = 'ST/11/BIO/854'
SELECT RIGHT(#str, CHARINDEX('/', REVERSE(#str)) - 1)
Edit:
To get second number starting from the end you can use:
DECLARE #str VARCHAR(100) = 'ST/11/BIO/1288/544'
SELECT SUBSTRING(#str, q2.x + 2, q2.x - q1.x - 1)
FROM (SELECT #str AS v) AS t
CROSS APPLY (SELECT CHARINDEX('/', REVERSE(#str))) AS q1(x)
CROSS APPLY (SELECT CHARINDEX('/', REVERSE(#str), q1.x + 1)) AS q2(x)
CROSS APPLY (SELECT LEN(#str)) AS s(l)
Use SUBSTRING, CHARINDEX, REVERSE
REVERSE : It will REVERSE your original string value
CHARINDEX : It will find index/location of any character from string value
SUBSTRING : It will split your string value
DECLARE #myString as VARCHAR(50)='ST/11/SCI/22'
SELECT
REVERSE
(
SUBSTRING
(
REVERSE(#myString),
0,
CHARINDEX('/',REVERSE(#myString))
)
)
I poste this as another answer, because the approach is completely different to the one of my other answer:
Starting with SQL Server 2012 (thx #NEER!) there is PARSENAME, which is a very straight approach to split a dot-delimited string up to 4 parts:
DECLARE #stringTable TABLE(string VARCHAR(100));
INSERT INTO #stringTable VALUES
('ST/11/SCI/1'),('ST/11/SCI/22'),('ST/11/BIO/854'),('ST/11/BIO/5421');
SELECT PARSENAME(REPLACE(s.string,'/','.'),1)
FROM #stringTable AS s
The result
1
22
854
5421
If you need all data you can split this easily and type-safe:
DECLARE #stringTable TABLE(string VARCHAR(100));
INSERT INTO #stringTable VALUES
('ST/11/SCI/1'),('ST/11/SCI/22'),('ST/11/BIO/854'),('ST/11/BIO/5421');
SELECT s.string
,x.value('/x[1]','nvarchar(10)') AS Part1
,x.value('/x[2]','int') AS Part2
,x.value('/x[3]','nvarchar(10)') AS Part3
,x.value('/x[4]','int') AS Part4
FROM #stringTable AS s
CROSS APPLY(SELECT CAST('<x>' + REPLACE(s.string,'/','</x><x>')+'</x>' AS XML)) AS A(x)
The result
string Part1 Part2 Part3 Part4
ST/11/SCI/1 ST 11 SCI 1
ST/11/SCI/22 ST 11 SCI 22
ST/11/BIO/854 ST 11 BIO 854
ST/11/BIO/5421 ST 11 BIO 5421
You can use a combination of reverse and charindex :
select reverse(left(reverse(...), charindex('/', reverse(...)) -1))
Documentation of reverse : https://msdn.microsoft.com/fr-fr/library/ms180040.aspx
Documentation of charindex : https://msdn.microsoft.com/fr-fr/library/ms186323.aspx
Test: select reverse(left(reverse('ST/11/SCI/2'), charindex('/', reverse('ST/11/SCI/2')) -1))
Output: 2
Note: We didnt know your version of SQL-Server, check the documentation of reverse/charindex to know wich version of SQL Server is supported !

Keep only desired characters and separate with semicolon in T-SQL

The problem:
I have text data imported into the db with a lot of unwanted characters. I need to keep only 4 capital letter strings within the imported text string. Example:
1447;#MIBD (This is a nice name);#2056;#LKRE (Very nice name indeed)
this could be in one column in one row of my table. What I need to extract from the string is:
MIBD and LKRE
And the result should preferably be the desired strings separated with semicolons.
It should be applied to the whole column and I cannot know how many of these 4 upper case letter strings might appear in one row.
Went through all sorts of function like PATINDEX etc. but really do not know how to approach it. thanks for any help!
try this, it assumes that the four char code is always preceded by ;# . As PATINDEX is case insensitive I have added additional check to verify that all the four character are capital.
DECLARE #MyTable Table( ID INT, MyString VARCHAR(8000))
INSERT INTO #MyTable
VALUES
(1, '1447;#MIBD (This is a nice name);#2056;#LKRE (Very nice name indeed)')
,(2, ';#DBCC (This is a nice name);#2056;#LLC (Very nice name indeed) ;#ABCD')
,(3, ';#AaaA;#OPQR;1234 (and) ;#WXYZ')
,(4, ';#abc this empty string without any code')
;WITH CTE AS
(
SELECT ID
,SUBSTRING(MyString, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+2, 4) AS NewString
,STUFF(MyString, 1, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+6, '') AS MyString
FROM #MyTable m
WHERE PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString) > 0
UNION ALL
SELECT ID
,SUBSTRING(MyString, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+2, 4) AS NewString
,STUFF(MyString, 1, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+6, '') AS MyString
FROM CTE c
WHERE PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString) > 0
)
SELECT c.ID,
STUFF(( SELECT '; ' + NewString
FROM CTE c1
WHERE c1.ID = c.ID
AND ASCII(SUBSTRING(NewString, 1, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- first char
AND ASCII(SUBSTRING(NewString, 2, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- second char
AND ASCII(SUBSTRING(NewString, 3, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- third char
AND ASCII(SUBSTRING(NewString, 4, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- fourth char
FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)') -- use the value clause to hanlde xml character issue like, &,",>,<
,1,1,'') AS CodeList
FROM CTE c
GROUP BY ID
OPTION (MAXRECURSION 0);
I came to something like this so far:
ALTER FUNCTION CleanData
(
-- Parameters here
#Text AS VARCHAR(4000)
)
RETURNS VARCHAR(4000)
AS
BEGIN
WHILE PATINDEX('%[0-9#;()]%', #Text) > 0
BEGIN
SET #Text = STUFF(#Text, PATINDEX('%[0-9#;()]%', #Text), 1, '')
END
RETURN #Text
END
But what I get is the Initials and the characters in parantheses as the PATINDEX cannot differ between the upper and lower case. Maybe it might be helpful for somebody else

Resources