TSQL Using SUBSTRING PATINDEX and STUFF to Amend Data - sql-server

TSQL MSSQL 2008r2
I need help to amend data.
I've got so far and now I need help.
Sample Data
[EDIT] Additonal examples added
DECLARE #Table TABLE (NodePropertyValue NVARCHAR(50))
INSERT INTO #Table (NodePropertyValue)
VALUES
(N'AA11✏AAA ZZZZ'),
(N'CRAP BB22✏BBB'),
(N'CC55✏CC1'),
(N'DD66✏666'),
(N'EE55✏EEE ES177'),
(N'RUBBISH FF22✏FFF XXXXXX'),
(N'NONSENSE')
I want to show the data like so.
If NCHAR(9999) or pencil exists and the next 3 characters are letters then add a slash (/) after the third character. If any other characters exist after the added slash then delete them. So for [AA11✏AAA ZZZZ] should be updated to [AA11✏AAA/].
If NCHAR(9999) exists and there are characters before the preceding 4 characters then delete them. So for [CRAP BB22✏BBB] should be updated to [BB22✏BBB/]
For [NONSENSE] should be shown as NULL.
This is as far as I have got. As you can see I'm stuck with adding a slash and removing characters not needed.
SELECT
V.NodePropertyValue 'Orignal'
,CASE --Pencil NCHAR(9999) exists
WHEN PATINDEX('%'+NCHAR(9999)+'%', UPPER(V.NodePropertyValue)) > 0
THEN
CASE
WHEN --FIRST 4 chars match XX11 and 5th char equals NCHAR(9999)
PATINDEX('[A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue)) > 0
AND SUBSTRING(V.NodePropertyValue, PATINDEX('%[A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue))+ 4, 1) = NCHAR(9999)
THEN
STUFF(V.NodePropertyValue, PATINDEX('[A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue))+ 4
, 50
, SUBSTRING(V.NodePropertyValue, PATINDEX('[A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue))+ 4, 50) )
WHEN --Any 4 chars match XX11 and preceding char is space and 5th char equals NCHAR(9999)
PATINDEX('% [A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue)) > 0
AND SUBSTRING(V.NodePropertyValue, PATINDEX('%[A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue))+ 4, 1) = NCHAR(9999)
THEN
STUFF(V.NodePropertyValue, PATINDEX('% [A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue))+ 4
, 50
, SUBSTRING(V.NodePropertyValue, PATINDEX('% [A-Z][A-Z][0-9][0-9]%', UPPER(V.NodePropertyValue))+ 4, 50) )
ELSE
NULL
END
ELSE
NULL
END 'Updated'
FROM
#Table V

Here is a way to get your desired results:
Create and populate sample table (I've added some more sample data based on our conversation in the comments)
DECLARE #Table TABLE (NodePropertyValue NVARCHAR(50))
INSERT INTO #Table (NodePropertyValue)
VALUES
(N'AA11✏AAA ZZZZ'),
(N'CRAP BB22✏BBB'),
(N'EE55✏EEE ES177'),
(N'RUBBISH FF22✏FFF XXXXXX'),
(N'AA✏AAA ZZZZ'),
(N'AA✏A2A ZZZZ'),
(N'AA✏A'),
(N'NONSENSE')
A cte to calculate the start and end of the desired pattern
;WITH CTE AS
(
SELECT NodePropertyValue,
-- note: there are are 4 underscores before the pencil
PATINDEX('%____'+ NCHAR(9999) +'[a-z][a-z][a-z]%', NodePropertyValue) As startPattern,
CHARINDEX(NCHAR(9999), NodePropertyValue) + 3 As EndPattern
FROM #Table
)
query the cte:
SELECT NodePropertyValue,
CASE WHEN startPattern > 0 THEN
SUBSTRING(NodePropertyValue, startPattern, EndPattern-startPattern+1) + '/'
ELSE
NULL
END As Updated
FROM CTE
Result:
NodePropertyValue Updated
AA11✏AAA ZZZZ AA11✏AAA/
CRAP BB22✏BBB BB22✏BBB/
EE55✏EEE ES177 EE55✏EEE/
RUBBISH FF22✏FFF XXXXXX FF22✏FFF/
AA✏AAA ZZZZ NULL
AA✏A2A ZZZZ NULL
AA✏A NULL
NONSENSE NULL
See a live demo on rextester.

If there are always letters after the pencil and no numbers, does this suffice?
select case when patindex('%' + nchar(9999) + '%' , NodePropertyValue)=0 then null
else substring( NodePropertyValue, patindex('%' + nchar(9999) + '%', NodePropertyValue)-4, 8) + '/'
end as StringStart
from #Table

Related

SQL for Splitting a string by hyphen, and then generating all the suffixes for each chunk that is 3 or more chars in length?

I am working with a Prefix Search engine, and I am trying to generate suffix keywords for my part numbers.
**Example String: 123456-7890-A-BCDEF-GHIJ-KL
I am looking to split this string into chunks, like this:
123456
7890
A
BCDEF
GHIJ
KL
Then I need to generate the suffixes of each chunk that is more that 3 chars in length, into one comma delimited list.
--For chunk 123456, I would get the suffixes 23456, 3456, 456, 56
--For chunk 7890, I would get the suffixes 890, 90
--For chunk A, it would be ignored as it is less than 3 chars in Length
--For chunk BCDEF, I would get the suffixes CDEF, DEF, EF
--For chunk GHIJ, I would get the suffixes HIJ, IJ
--For chunk KL, it would be ignored as it is less than 3 chars in Length
My string could have any amount of chars in each chunk, they are not always formatted like the example.
So the final result for string 123456-7890-A-BCDEF-GHIJ would look like this;
23456, 3456, 456, 56, 890, 90, CDEF, DEF, EF, HIJ, IJ
My string could have any amount of chars in each chunk, they are not always formatted like the example.
Some other example strings;
123-4567890-ABC-DEFGHIJ-K-L
--Result: 23, 567890, 67890, 7890, 890, 90, BC, EFGHIJ, FGHIJ, GHIJ, HIJ, IJ
123456-7-890AB-CDEFG-H-IJKL
--Result: 23456, 3456, 456, 56, 90AB, 0AB, AB, DEFG, EFG, FG, JKL, KL
I acknowledge this is not quite what you are looking for, but it is close. Perhaps this will give you or someone else an idea to get exactly what you want. This assumes SQL Server 2017 or higher.
So I am using STRING_SPLIT() to divide the string into one row for each chunk numbering those rows using ROW_NUMBER(). There is a problem with this in that STRING_SPLIT() does not guarantee order.
I then use a CTE to basically come up with index values to step through the each chunk from 2 to chunk length - 1 putting those strings back into a comma-separated list for each chunk using STRING_AGG().
I insert those results to a temp table so that I can select them in order by row number and assemble the suffixes from each chunk into the final comma-separated list.
DECLARE #MyString VARCHAR(50);
SET #MyString = '123456-7890-A-BCDEF-GHIJ';
WITH cte
AS (SELECT
2 AS n -- anchor member
, value
, ROW_NUMBER() OVER (ORDER BY value) AS [rn]
FROM STRING_SPLIT(#MyString, '-')
WHERE LEN(value) >= 3
UNION ALL
SELECT
n + 1
, cte.value -- recursive member
, cte.rn
FROM cte
WHERE n < (LEN(value) - 1) -- terminator
)
SELECT STRING_AGG(SUBSTRING(value, n, LEN(value) - n + 1), ', ') as [Chunk]
INTO #Temp
FROM cte
GROUP BY rn
ORDER BY rn;
SELECT STRING_AGG(Chunk, ', ')
FROM #Temp
Here is the dbfiddle.
There is probably a more elegant way to do this and likely better string manipulators than t-sql, but this seems to work: (requires DB to be compatibility mode = SQL 2016 or above to use STRING_SPLIT)
DECLARE #STRING VARCHAR (100)
DECLARE #DIVIDEON CHAR (1)
DECLARE #FULLSTRING VARCHAR (100)
DECLARE #SUBSTR VARCHAR (100)
DECLARE #WRKSTRING VARCHAR (100)
DECLARE #LEN TINYINT
SELECT #STRING = '123456-7-890AB-CDEFG-H-IJKL'
SELECT #DIVIDEON = '-'
SELECT #FULLSTRING =''
IF OBJECT_ID('tempdb..#mytable') IS NOT NULL
DROP TABLE #mytable
select RIGHT(value,LEN(value)-1) as MyString
into #mytable
from STRING_SPLIT (#STRING, #DIVIDEON) -- only SQL 2016 and above
where len(value)>2 -- ignore subsets that do not have 3 or more chars
while exists (select MyString from #mytable)
begin
select #WRKSTRING = (select top 1 MyString from #mytable)
select #LEN = len(#WRKSTRING)
select #SUBSTR=#WRKSTRING
while #LEN>2
begin
select #SUBSTR=#SUBSTR+', '+RIGHT(#WRKSTRING,#LEN-1)
select #LEN=#LEN-1
end
delete from #mytable where MyString = #WRKSTRING
--select #SUBSTR as Fullstr
select #FULLSTRING=#FULLSTRING+#SUBSTR+','
end
select LEFT(#FULLSTRING,LEN(#FULLSTRING)-1)
drop table #mytable
--not performant...maybe good enough(?)
declare #t table
(
id int identity primary key clustered,
thecol varchar(40)
);
insert into #t(thecol)
values('1-2-3-4-5-6'), ('abcd') /*??*/;
insert into #t(thecol)
select top (10000) newid()
from master.dbo.spt_values as a
cross join master.dbo.spt_values as b;
select *,
thelist= replace(
cast('<!--'+replace(thecol, '-', '--><!--')+'-->' as xml).query('
let $seq := (2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) (: note: max 22 chars per string-part:)
for $a in comment()
let $str := string($a), $maxlen := string-length($str)-1
for $i in $seq[. <= $maxlen]
return concat(substring($str, $i, 100), "," )
').value('.', 'varchar(max)')+'_', ',_', '')
from #t;

SQL string before and after certain characters

SELECT NAME
FROM SERVERS
returns:
SDACR.hello.com
SDACR
SDACR\AIR
SDACR.hello.com\WATER
I need the SELECT query for below result:
SDACR
SDACR
SDACR\AIR
SDACR\WATER
Kindly help ! I tried using LEFT and RIGHT functions as below, but not able to get combined output correctly:
SELECT
LEFT(Name, CHARINDEX('.', Name) - 1)
FROM
SERVERS
SELECT
RIGHT(Name, LEN(Name) - CHARINDEX('\', Name))
FROM
SERVERS
It looks like you're just trying to REPLACE a substring of characters in your column. You should try this:
SELECT REPLACE(Name,'.hello.com','') AS ReplacementName
FROM SERVERS
In tsql, you can concatenate values with CONCAT(), or you can simply add strings together with +.
SELECT LEFT(Name, CHARINDEX('.',Name)-1) + RIGHT(Name,LEN(Name)-CHARINDEX('\',Name)) from SERVERS
Also, be careful with doing arithmetic with CHARINDEX(). A value without a '.' or a '\' will return a NULL and you will get an error.
You can use LEFT for this to select everything up to the first period (dot) and add on everything after the last \
declare #servers table ([NAME] varchar(64))
insert into #servers
values
('SDACR.hello.com '),
('SDACR'),
('SDACR\AIR'),
('SDACR.hello.com\WATER')
select
left([NAME],case when charindex('.',[NAME]) = 0 then len([NAME]) else charindex('.',[NAME]) -1 end) +
case when charindex('\',left([NAME],case when charindex('.',[NAME]) = 0 then len([NAME]) else charindex('.',[NAME]) -1 end)) = 0 then right([NAME],charindex('\',reverse([NAME]))) else '' end
from #servers
Throwing my hat in.... Showing how to use Values and APPLY for cleaner code.
-- sample data in an easily consumable format
declare #yourdata table (txt varchar(100));
insert #yourdata values
('SDACR.hello.com'),
('SDACR'),
('SDACR\AIR'),
('SDACR.hello.com\WATER');
-- solution
select
txt,
newTxt =
case
when loc.dot = 0 then txt
when loc.dot > 0 and loc.slash = 0 then substring(txt, 1, loc.dot-1)
else substring(txt, 1, loc.dot-1) + substring(txt, loc.slash, 100)
end
from #yourdata
cross apply (values (charindex('.',txt), (charindex('\',txt)))) loc(dot,slash);
Results
txt newTxt
------------------------------ --------------------
SDACR.hello.com SDACR
SDACR SDACR
SDACR\AIR SDACR\AIR
SDACR.hello.com\WATER SDACR\WATER

SQL String: Counting Words inside a String

I searched through many of the questions here but all I found with decent answer is for different language like Javascript etc.
I have a simple task in SQL that I can't seem to find a simple way to do.
I just need to count the number of "words" inside a SQL string (a sentence). You can see why "words" is in quotes in my examples. The "words" are delimited by white space.
Sample sentences:
1. I am not your father.
2. Where are your brother,sister,mother?
3. Where are your brother, sister and mother?
4. Who are you?
Desired answer:
1. 5
2. 4
3. 7
4. 3
As you can see, I need to count the "words" disregarding the symbols (I have to treat them as part of the word). So in sample no. 2:
(1)Where (2)are (3)your (4)brother,sister,mother? = 4
I can handle the multiple whitespaces by doing a replace like this:
REPLACE(string, ' ', ' ') -> 2 whitespaces to 1
REPLACE(string, ' ', ' ') -> 3 whitespaces to 1 and so on..
What SQL function can I use to do this? I use SQL Server 2012 but needs a function that works in SQL Server 2008 as well.
Here is one way to do it:
Create and populate sample table (Please save is this step in your future questions)
DECLARE #T AS TABLE
(
id int identity(1,1),
string varchar(100)
)
INSERT INTO #T VALUES
('I am not your father.'),
('Where are your brother,sister,mother?'),
('Where are your brother, sister and mother?'),
('Who are you?')
Use a cte to replace multiple spaces to a single space (Thanks to Gordon Linoff's answer here)
;WITH CTE AS
(
SELECT Id,
REPLACE(REPLACE(REPLACE(string, ' ', '><' -- Note that there are 2 spaces here
), '<>', ''
), '><', ' '
) as string
FROM #T
)
Query the CTE - length of the string - length of the string without spaces + 1:
SELECT id, LEN(string) - LEN(REPLACE(string, ' ', '')) + 1 as CountWords
FROM CTE
Results:
id CountWords
1 5
2 4
3 7
4 3
This is a minor improvement of #ZoharPeled's answer. This can also handle 0 length values:
DECLARE #t AS TABLE(id int identity(1,1), string varchar(100))
INSERT INTO #t VALUES
('I am not your father.'),
('Where are your brother,sister,mother?'),
('Where are your brother, sister and mother?'),
('Who are you?'),
('')
;WITH CTE AS
(
SELECT
Id,
REPLACE(REPLACE(string,' ', '><'), '<>', '') string
FROM #t
)
SELECT
id,
LEN(' '+string)-LEN(REPLACE(string, '><', ' ')) CountWords
FROM CTE
To handle multiple spaces too, use the method shown here
Declare #s varchar(100)
set #s='Who are you?'
set #s=ltrim(rtrim(#s))
while charindex(' ',#s)>0
Begin
set #s=replace(#s,' ',' ')
end
select len(#s)-len(replace(#s,' ',''))+1 as word_count
https://exploresql.com/2018/07/31/how-to-count-number-of-words-in-a-sentence/
I found this query more useful than the first. it omit extra characters and numbers and symbols, so it would count just words within a passage...
drop table if exists #t
create table #t (id int identity(1,1), c1 varchar(2000))
insert into #t (c1)
values
('Alireza Sattarzadeh Farkoush '),
('yes it is the   best .'),
('abc def ghja a the . asw'),
('?>< 123 ...!  z a b'),
('Wallex is   the greatest exchange in the .. world a after binance ...!')
select c1 , Count(*)
from (
select id, c1, value  
from #t t
cross apply (
select rtrim(ltrim(value)) as value from string_split(c1,' ')) a
where len(value) > 1 and value like '%[a-Z]%'
) Final
group by c1

How to return list of duplicate words and the count of instances in a table

I basically have a table with a column. Lets call the column 'Summary'
So if 'Summary' looks like this. I went to the park to find a dog. The dog was not there. I left because there was no dog.
I want to be able to return a list that basically gives me the duplicate words and the hit count of how many times it appeared. I won't know which word exactly is a duplicate so I cannot hard code it into the SQL query.
I need the results to be "Dog" -3, "The"- 2, "I"- 2
I cant post images so I cannot post a table
This is not necessarily a very efficient way of achieving the result you are looking for, but this will output a list of words that have a count of 2 or more in the specified summary:
DECLARE #summary NVARCHAR(MAX)
SET #summary = N'I went to the park to find a dog. The dog was not there. I left because there was no dog.'
SET NOCOUNT ON
DECLARE #PosA INT
DECLARE #Word NVARCHAR(MAX)
-- A temporary table to hold matches
CREATE TABLE dbo.#WordList
(
Word NVARCHAR(MAX),
WordCount INT
)
SET #PosA = 0
WHILE (LEN(#summary) > 0)
BEGIN
-- Find the position of the word end
SET #PosA = CHARINDEX(' ', #summary)
IF (#PosA = 0)
SET #PosA = LEN(#summary) + 1
-- Extract the word and shorten the summary text
SET #Word = SUBSTRING(#summary, 0, #PosA)
IF (#PosA < LEN(#summary))
SET #summary = SUBSTRING(#summary, #PosA + 1, LEN(#summary) - #PosA)
ELSE
SET #summary = ''
-- Strip punctuation
SET #Word = REPLACE(REPLACE(#Word, '.', ''), ',', '')
-- Add or create the word
IF EXISTS ( SELECT TOP 1 1 FROM dbo.#WordList WHERE Word = #Word)
UPDATE dbo.#WordList
SET WordCount = WordCount + 1
WHERE (Word = #Word)
ELSE
INSERT INTO dbo.#WordList (Word, WordCount)
VALUES (#Word, 1)
END
-- Get results
SELECT *
FROM dbo.#WordList
WHERE (WordCount > 1)
ORDER BY Word
--- Tidy up
DROP TABLE dbo.#WordList
Effectively, split the summary text by each space and then remove punctuation from the resulting word. The resulting words are stored in the #WordList temporary table, with the count incremented as appropriate.
Finally the results are returned at the end.
Note that you may wish to improve the punctuation removal as I only added full-stops and commas for the purposes of this answer.
I think that for each row, you'll need to split the summary column into separate rows. Then you can do select on that result set, counting each value. Here's a link to a bunch of nice Split functions:
Split functions
They are pretty old, but still very effective. I think something like tvf should get you going:
CREATE FUNCTION dbo.Split (#sep char(1), #s varchar(512))
RETURNS table
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(#sep, #s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn,
SUBSTRING(#s, start, CASE WHEN stop > 0 THEN stop-start ELSE 512 END) AS s
FROM Pieces
)
DECLARE #summaries TABLE (id int, summary nvarchar(max))
INSERT #summaries values
(1,N'I went to the park to find a dog. The dog was not there. I left because there was no dog.')
SELECT id, word, COUNT(*) c
FROM #summaries
CROSS APPLY (SELECT CAST('<a>'+REPLACE(summary,' ','</a><a>')+'</a>' AS xml) xml1 ) t1
CROSS APPLY (SELECT n.value('.','varchar(max)') AS word FROM xml1.nodes('a') x(n) ) t2
GROUP BY id, word
HAVING COUNT(*) > 1

Inserting line breaks in column after or before specific character and lenght

Here is another weird data column. This table contains long strings of text. I have been trying to make the column readable and managed to strip most of the rubbish, such as creating a replace function and inserting a colon between specific character but it does not really add to the readability.
Is it possible to break (new line) the string shown below after the last number? I don't know how to do this but it's just an idea. Like say insert line break after X Character of the first Zero, since the numbers starts with zero and are the same length?
This is one of those archive data I stumbled upon.
Here is an example of the table: Note the bottom portion is for dramatization only.
Thanks!
CREATE FUNCTION dbo.Split (#sep char(1), #s varchar(512))
RETURNS table
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #s)+6
UNION ALL
SELECT pn + 1, stop + 1,
CASE
WHEN (CHARINDEX(#sep, #s, stop + 1)+6)>LEN(#s)
THEN -1
ELSE CHARINDEX(#sep, #s, stop + 1)+6
END
FROM Pieces
WHERE stop > 0
)
SELECT pn,
SUBSTRING(#s, start, CASE WHEN stop > 0 THEN stop-start ELSE 512 END) AS s
FROM Pieces
)
Which can be queried as
SELECT * FROM dbo.Split ('0', 'Titanic 012345 Casino Royal 012346 Terminator 098987')
Please note that 512 can be changed if you need longer string to be processed.
Edit: Solution also assumes there is no '0' in the movie title itself.
Idea taken from Cade Roux who provided link to original solution
Based in the excellent approach by #PavelNefyodov an alternative approach using PATINDEX and CROSS APPLY.
This one requires that you know how many digits you are going to have in advance.Fiddle
CREATE FUNCTION dbo.PatindexSplit (#PatternSearch varchar(100), #PatternLen SMALLINT,#s varchar(max))
RETURNS table
AS
RETURN (
WITH Pieces(pn, Piece, rest) AS (
SELECT 1, LEFT(#s, PATINDEX(#PatternSearch, #s)+#PatternLen), SUBSTRING(#s, PATINDEX(#PatternSearch, #s)+#PatternLen+1,LEN(#s))
UNION ALL
SELECT pn + 1,
CASE WHEN PATINDEX(#PatternSearch, rest) > 0
THEN LEFT(rest, PATINDEX(#PatternSearch, rest)+#PatternLen)
ELSE rest
END,
CASE WHEN PATINDEX(#PatternSearch, rest) > 0
THEN SUBSTRING(rest, PATINDEX(#PatternSearch, rest)+#PatternLen+1,LEN(rest))
ELSE ''
END
FROM Pieces
WHERE LEN(rest) > 0
)
SELECT pn,Piece, rest
FROM Pieces
)
go
SELECT userID, Piece FROM Rentals
CROSS APPLY dbo.PatindexSplit ('%0[0-9][0-9][0-9][0-9][0-9] %', 6, MovieTitle)
SELECT * FROM dbo.PatindexSplit ('%0[0-9][0-9][0-9][0-9][0-9] %', 6, 'Titanic 012345 Casino Royal 012346 Terminator 098987')
Anyway 1+ to do it in the application layer...

Resources