Keep only desired characters and separate with semicolon in T-SQL - sql-server

The problem:
I have text data imported into the db with a lot of unwanted characters. I need to keep only 4 capital letter strings within the imported text string. Example:
1447;#MIBD (This is a nice name);#2056;#LKRE (Very nice name indeed)
this could be in one column in one row of my table. What I need to extract from the string is:
MIBD and LKRE
And the result should preferably be the desired strings separated with semicolons.
It should be applied to the whole column and I cannot know how many of these 4 upper case letter strings might appear in one row.
Went through all sorts of function like PATINDEX etc. but really do not know how to approach it. thanks for any help!

try this, it assumes that the four char code is always preceded by ;# . As PATINDEX is case insensitive I have added additional check to verify that all the four character are capital.
DECLARE #MyTable Table( ID INT, MyString VARCHAR(8000))
INSERT INTO #MyTable
VALUES
(1, '1447;#MIBD (This is a nice name);#2056;#LKRE (Very nice name indeed)')
,(2, ';#DBCC (This is a nice name);#2056;#LLC (Very nice name indeed) ;#ABCD')
,(3, ';#AaaA;#OPQR;1234 (and) ;#WXYZ')
,(4, ';#abc this empty string without any code')
;WITH CTE AS
(
SELECT ID
,SUBSTRING(MyString, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+2, 4) AS NewString
,STUFF(MyString, 1, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+6, '') AS MyString
FROM #MyTable m
WHERE PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString) > 0
UNION ALL
SELECT ID
,SUBSTRING(MyString, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+2, 4) AS NewString
,STUFF(MyString, 1, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+6, '') AS MyString
FROM CTE c
WHERE PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString) > 0
)
SELECT c.ID,
STUFF(( SELECT '; ' + NewString
FROM CTE c1
WHERE c1.ID = c.ID
AND ASCII(SUBSTRING(NewString, 1, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- first char
AND ASCII(SUBSTRING(NewString, 2, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- second char
AND ASCII(SUBSTRING(NewString, 3, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- third char
AND ASCII(SUBSTRING(NewString, 4, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- fourth char
FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)') -- use the value clause to hanlde xml character issue like, &,",>,<
,1,1,'') AS CodeList
FROM CTE c
GROUP BY ID
OPTION (MAXRECURSION 0);

I came to something like this so far:
ALTER FUNCTION CleanData
(
-- Parameters here
#Text AS VARCHAR(4000)
)
RETURNS VARCHAR(4000)
AS
BEGIN
WHILE PATINDEX('%[0-9#;()]%', #Text) > 0
BEGIN
SET #Text = STUFF(#Text, PATINDEX('%[0-9#;()]%', #Text), 1, '')
END
RETURN #Text
END
But what I get is the Initials and the characters in parantheses as the PATINDEX cannot differ between the upper and lower case. Maybe it might be helpful for somebody else

Related

Regex in SQL Server Replace function

I have a variable with random text, let's say
DECLARE #sNumberFormat NVARCHAR(200) = 'rand{text.here,{999}also-Random9He8re'
I want to replace each 9 in {999} by [0-9]. So in this example I would like to get
'rand{text.here,[0-9][0-9][0-9]also-Random9He8re'
Problem is I never know how many 9 will be placed in brackets, so there can be {99} {9999} ..and go on. I also need to validate if there is any invalid character (not 9) then nothing should be replaced.
I have tried some combinations of REPLACE and PATINDEX functions, but I could not achieve that.
Sans robust regex support, SQL Server's native functions do not give much help here. One approach, a bit hackish, would be to separate the input string into three components:
rand{text.here,
{999}
also-Random9He8re
Next, replace the 9 in the middle target substring with #, or some other character which you don't expect to appear anywhere else in your input string:
rand{text.here,
{###}
also-Random9He8re
Finally, replace the # in the middle substring with [0-9] and then concatenate together to get the final result:
DECLARE #val NVARCHAR(200) = 'rand{text.here,{999}also-Random9He8re'
SELECT REPLACE(
SUBSTRING(#val, 1, CHARINDEX('{9', #val) - 1) +
REPLACE(SUBSTRING(#val,
CHARINDEX('{9', #val) + 1,
CHARINDEX('9}', #val) - CHARINDEX('{9', #val)), '9', '#') +
SUBSTRING(#val, CHARINDEX('9}', #val) + 2, LEN(#val) - CHARINDEX('9}', #val)),
'#', '[0-9]');
So the lazy dev in me suggests this:
SELECT Replace(
Replace(
Replace(
Replace(#input, '{9999}', '[0-9][0-9][0-9][0-9]')
, '{999}', '[0-9][0-9][0-9]')
, '{99}', '[0-9][0-9]')
, '{9}', '[0-9]') AS result
;
You can keep extending as long as you like to perform your (one off?) replacements.
Quick. Simple. Extensible. Hacky.
Sometimes lazy is good enough.
This could be done with CTE series. It works with an arbitrary number of "9" values in square brackets.
Declare #str varchar(max) = 'rand{text.here,{999}also-Random9He8re';
With A As
(Select 1 As Pos
Union All
Select Pos+1 As Pos From A Where Pos < LEN(#str)
),
B As (
Select STRING_AGG(Case When Chr Like '[{9}]' Then Chr Else ' ' End, '') As Chr
From A Cross Apply (Select SUBSTRING(#str,A.Pos,1 )) As T(chr)
),
C As (
Select [value] As pattern,
REPLACE(REPLACE(REPLACE([value], '9', '[0-9]'),'{',''),'}','') As replacement,
ROW_NUMBER() Over (ORDER BY (SELECT NULL)) As Num,
COUNT(*) OVER (ORDER BY (SELECT NULL)) As Cnt
From B Cross Apply STRING_SPLIT(Chr,' ')
Where [value] Like '{%}' And [value] Like '%9%'
),
D As (
Select #str As Result, 1 As Num
Union All
select REPLACE(Result, C.pattern, C.replacement) As Res , D.Num+1 As Num
From D Inner Join C On (D.Num=C.Num)
Where D.Num<=C.Cnt)
Select Top 1 Result
From D
Order by Num Desc
A - Getting a list of character positions in text
B - Getting text with spaces instead of characters other than
'9','{','}'
C- Getting patterns and corresponding replacement values
D - Getting the result using REPLACEMENT function

SQL for Splitting a string by hyphen, and then generating all the suffixes for each chunk that is 3 or more chars in length?

I am working with a Prefix Search engine, and I am trying to generate suffix keywords for my part numbers.
**Example String: 123456-7890-A-BCDEF-GHIJ-KL
I am looking to split this string into chunks, like this:
123456
7890
A
BCDEF
GHIJ
KL
Then I need to generate the suffixes of each chunk that is more that 3 chars in length, into one comma delimited list.
--For chunk 123456, I would get the suffixes 23456, 3456, 456, 56
--For chunk 7890, I would get the suffixes 890, 90
--For chunk A, it would be ignored as it is less than 3 chars in Length
--For chunk BCDEF, I would get the suffixes CDEF, DEF, EF
--For chunk GHIJ, I would get the suffixes HIJ, IJ
--For chunk KL, it would be ignored as it is less than 3 chars in Length
My string could have any amount of chars in each chunk, they are not always formatted like the example.
So the final result for string 123456-7890-A-BCDEF-GHIJ would look like this;
23456, 3456, 456, 56, 890, 90, CDEF, DEF, EF, HIJ, IJ
My string could have any amount of chars in each chunk, they are not always formatted like the example.
Some other example strings;
123-4567890-ABC-DEFGHIJ-K-L
--Result: 23, 567890, 67890, 7890, 890, 90, BC, EFGHIJ, FGHIJ, GHIJ, HIJ, IJ
123456-7-890AB-CDEFG-H-IJKL
--Result: 23456, 3456, 456, 56, 90AB, 0AB, AB, DEFG, EFG, FG, JKL, KL
I acknowledge this is not quite what you are looking for, but it is close. Perhaps this will give you or someone else an idea to get exactly what you want. This assumes SQL Server 2017 or higher.
So I am using STRING_SPLIT() to divide the string into one row for each chunk numbering those rows using ROW_NUMBER(). There is a problem with this in that STRING_SPLIT() does not guarantee order.
I then use a CTE to basically come up with index values to step through the each chunk from 2 to chunk length - 1 putting those strings back into a comma-separated list for each chunk using STRING_AGG().
I insert those results to a temp table so that I can select them in order by row number and assemble the suffixes from each chunk into the final comma-separated list.
DECLARE #MyString VARCHAR(50);
SET #MyString = '123456-7890-A-BCDEF-GHIJ';
WITH cte
AS (SELECT
2 AS n -- anchor member
, value
, ROW_NUMBER() OVER (ORDER BY value) AS [rn]
FROM STRING_SPLIT(#MyString, '-')
WHERE LEN(value) >= 3
UNION ALL
SELECT
n + 1
, cte.value -- recursive member
, cte.rn
FROM cte
WHERE n < (LEN(value) - 1) -- terminator
)
SELECT STRING_AGG(SUBSTRING(value, n, LEN(value) - n + 1), ', ') as [Chunk]
INTO #Temp
FROM cte
GROUP BY rn
ORDER BY rn;
SELECT STRING_AGG(Chunk, ', ')
FROM #Temp
Here is the dbfiddle.
There is probably a more elegant way to do this and likely better string manipulators than t-sql, but this seems to work: (requires DB to be compatibility mode = SQL 2016 or above to use STRING_SPLIT)
DECLARE #STRING VARCHAR (100)
DECLARE #DIVIDEON CHAR (1)
DECLARE #FULLSTRING VARCHAR (100)
DECLARE #SUBSTR VARCHAR (100)
DECLARE #WRKSTRING VARCHAR (100)
DECLARE #LEN TINYINT
SELECT #STRING = '123456-7-890AB-CDEFG-H-IJKL'
SELECT #DIVIDEON = '-'
SELECT #FULLSTRING =''
IF OBJECT_ID('tempdb..#mytable') IS NOT NULL
DROP TABLE #mytable
select RIGHT(value,LEN(value)-1) as MyString
into #mytable
from STRING_SPLIT (#STRING, #DIVIDEON) -- only SQL 2016 and above
where len(value)>2 -- ignore subsets that do not have 3 or more chars
while exists (select MyString from #mytable)
begin
select #WRKSTRING = (select top 1 MyString from #mytable)
select #LEN = len(#WRKSTRING)
select #SUBSTR=#WRKSTRING
while #LEN>2
begin
select #SUBSTR=#SUBSTR+', '+RIGHT(#WRKSTRING,#LEN-1)
select #LEN=#LEN-1
end
delete from #mytable where MyString = #WRKSTRING
--select #SUBSTR as Fullstr
select #FULLSTRING=#FULLSTRING+#SUBSTR+','
end
select LEFT(#FULLSTRING,LEN(#FULLSTRING)-1)
drop table #mytable
--not performant...maybe good enough(?)
declare #t table
(
id int identity primary key clustered,
thecol varchar(40)
);
insert into #t(thecol)
values('1-2-3-4-5-6'), ('abcd') /*??*/;
insert into #t(thecol)
select top (10000) newid()
from master.dbo.spt_values as a
cross join master.dbo.spt_values as b;
select *,
thelist= replace(
cast('<!--'+replace(thecol, '-', '--><!--')+'-->' as xml).query('
let $seq := (2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) (: note: max 22 chars per string-part:)
for $a in comment()
let $str := string($a), $maxlen := string-length($str)-1
for $i in $seq[. <= $maxlen]
return concat(substring($str, $i, 100), "," )
').value('.', 'varchar(max)')+'_', ',_', '')
from #t;

TSQL: How insert separator between each character in a string

I have a string like this:
Apple
I want to include a separator after each character so the end result will turn out like this:
A,p,p,l,e
In C#, we have one liner method to achieve the above with Regex.Replace('Apple', ".{1}", "$0,");
I can only think of looping each character with charindex to append the separator but seems a little complicated. Is there any elegant way and simpler way to achieve this?
Thanks HABO for the suggestions. I'm able to generate the result that I want using the code but takes a little bit of time to really understand how the code work.
After some searching, I manage to found one useful article to insert empty spaces between each character and it's easier for me to understand.
I modify the code a little to define and include desire separator instead of fixing it to space as the separator:
DECLARE #pos INT = 2 -- location where we want first space
DECLARE #result VARCHAR(100) = 'Apple'
DECLARE #separator nvarchar(5) = ','
WHILE #pos < LEN(#result)+1
BEGIN
SET #result = STUFF(#result, #pos, 0, #separator);
SET #pos = #pos+2;
END
select #result; -- Output: A,p,p,l,e
Reference
In following SQL scripts, I get each character using SUBSTRING() function using with a number table (basically I used spt_values view here for simplicity) and then I concatenate them via two different methods, you can choose one
If you are using SQL Server 2017, we have a new SQL string aggregation function
First script uses string_agg function
declare #str nvarchar(max) = 'Apple'
SELECT
string_agg( substring(#str,number,1) , ',') Within Group (Order By number)
FROM master..spt_values n
WHERE
Type = 'P' and
Number between 1 and len(#str)
If you are working with a previous version, you can use string concatenation using FOR XML Path and SQL Stuff function as follows
declare #str nvarchar(max) = 'Apple'
; with cte as (
SELECT
number,
substring(#str,number,1) as L
FROM master..spt_values n
WHERE
Type = 'P' and
Number between 1 and len(#str)
)
SELECT
STUFF(
(
SELECT
',' + L
FROM cte
order by number
FOR XML PATH('')
), 1, 1, ''
)
Both solution yields the same result, I hope it helps
If you have SQL Server 2017 and a copy of ngrams8k it's ultra simple:
declare #word varchar(100) = 'apple';
select newString = string_agg(token, ',') within group (order by position)
from dbo.ngrams8k(#word,1);
For pre-2017 systems it's almost as simple:
declare #word varchar(100) = 'apple';
select newstring =
( select token + case len(#word)+1-position when 1 then '' else ',' end
from dbo.ngrams8k(#word,1)
order by position
for xml path(''))
One ugly way to do it is to split the string into characters, ideally using a numbers table, and reassemble it with the desired separator.
A less efficient implementation uses recursion in a CTE to split the characters and insert the separator between pairs of characters as it goes:
declare #Sample as VarChar(20) = 'Apple';
declare #Separator as Char = ',';
with Characters as (
select 1 as Position, Substring( #Sample, 1, 1 ) as Character
union all
select Position + 1,
case when Position & 1 = 1 then #Separator else Substring( #Sample, Position / 2 + 1, 1 ) end
from Characters
where Position < 2 * Len( #Sample ) - 1 )
select Stuff( ( select Character + '' from Characters order by Position for XML Path( '' ) ), 1, 0, '' ) as Result;
You can replace the select Stuff... line with select * from Characters; to see what's going on.
Try this
declare #var varchar(50) ='Apple'
;WITH CTE
AS
(
SELECT
SeqNo = 1,
MyStr = #var,
OpStr = CAST('' AS VARCHAR(50))
UNION ALL
SELECT
SeqNo = SeqNo+1,
MyStr = MyStR,
OpStr = CAST(ISNULL(OpStr,'')+SUBSTRING(MyStR,SeqNo,1)+',' AS VARCHAR(50))
FROM CTE
WHERE SeqNo <= LEN(#var)
)
SELECT
OpStr = LEFT(OpStr,LEN(OpStr)-1)
FROM CTE
WHERE SeqNo = LEN(#Var)+1

Issue with patindex and unicode character '-'

I have a string called Dats which is either of the general appearence xxxx-nnnnn (where x is a character, and n is a number) or nnn-nnnnnn.
I want to return only the numbers.
For this I've tried:
SELECT Distinct dats,
Left(SubString(artikelnr, PatIndex('%[0-9.-]%', artikelnr), 8000), PatIndex('%[^0-9.-]%', SubString(artikelnr, PatIndex('%[0-9.-]%', artikelnr), 8000) + 'X')-1)
FROM ThatDatabase
It is almost what I want. It removes the regular characters x, but it does not remove the unicode character -. How can I remove this as well? And also, it seems rather ineffective to have two PatIndex functions for every row, is there a way to avoid this? (This will be used on a big database where the result of this Query will be used as keys).
EDIT: Updated as a new database sometimes contained additional -'s or . together with -.
DECLARE #T as table
(
dats nvarchar(10)
)
INSERT INTO #T VALUES
('111BWA30'),
('115-200-11')
('115-22.4-1')
('10.000.22')
('600F-FFF200')
I wasn't sure if you wanted the numbers before the - char as well, but if you do, here is one way to do it:
Create and populate sample table (Please save us this step in your future questions)
DECLARE #T as table
(
dats nvarchar(10)
)
INSERT INTO #T VALUES
('abcde-1234'),
('23-343')
The query:
SELECT dats,
case when patindex('%[^0-9]-[0-9]%', dats) > 0 then
right(dats, len(dats) - patindex('%-[0-9]%', dats))
else
stuff(dats, charindex('-', dats), 1, '')
end As NumbersOnly
FROM #T
Results:
dats NumbersOnly
abcde-1234 1234
23-343 23343
If you want the only the numbers to the right of the - char, it's simpler:
SELECT dats,
right(dats, len(dats) - patindex('%-[0-9]%', dats)) As RightNumbersOnly
FROM #T
Results:
dats RightNumbersOnly
abcde-1234 1234
23-343 343
If you know which characters you need to remove then use REPLACE function
DECLARE #T as table
(
dats nvarchar(100)
)
INSERT INTO #T
VALUES
('111BWA30'),
('115-200-11'),
('115-22.4-1'),
('10.000.22'),
('600F-FFF200')
SELECT REPLACE(REPLACE(dats, '.', ''), '-', '')
FROM #T

How do I extract part of a string in t-sql

If I have the following nvarchar variable - BTA200, how can I extract just the BTA from it?
Also, if I have varying lengths such as BTA50, BTA030, how can I extract just the numeric part?
I would recommend a combination of PatIndex and Left. Carefully constructed, you can write a query that always works, no matter what your data looks like.
Ex:
Declare #Temp Table(Data VarChar(20))
Insert Into #Temp Values('BTA200')
Insert Into #Temp Values('BTA50')
Insert Into #Temp Values('BTA030')
Insert Into #Temp Values('BTA')
Insert Into #Temp Values('123')
Insert Into #Temp Values('X999')
Select Data, Left(Data, PatIndex('%[0-9]%', Data + '1') - 1)
From #Temp
PatIndex will look for the first character that falls in the range of 0-9, and return it's character position, which you can use with the LEFT function to extract the correct data. Note that PatIndex is actually using Data + '1'. This protects us from data where there are no numbers found. If there are no numbers, PatIndex would return 0. In this case, the LEFT function would error because we are using Left(Data, PatIndex - 1). When PatIndex returns 0, we would end up with Left(Data, -1) which returns an error.
There are still ways this can fail. For a full explanation, I encourage you to read:
Extracting numbers with SQL Server
That article shows how to get numbers out of a string. In your case, you want to get alpha characters instead. However, the process is similar enough that you can probably learn something useful out of it.
substring(field, 1,3) will work on your examples.
select substring(field, 1,3) from table
Also, if the alphabetic part is of variable length, you can do this to extract the alphabetic part:
select substring(field, 1, PATINDEX('%[1234567890]%', field) -1)
from table
where PATINDEX('%[1234567890]%', field) > 0
LEFT ('BTA200', 3) will work for the examples you have given, as in :
SELECT LEFT(MyField, 3)
FROM MyTable
To extract the numeric part, you can use this code
SELECT RIGHT(MyField, LEN(MyField) - 3)
FROM MyTable
WHERE MyField LIKE 'BTA%'
--Only have this test if your data does not always start with BTA.
declare #data as varchar(50)
set #data='ciao335'
--get text
Select Left(#Data, PatIndex('%[0-9]%', #Data + '1') - 1) ---->>ciao
--get numeric
Select right(#Data, len(#data) - (PatIndex('%[0-9]%', #Data )-1) ) ---->>335

Resources