complex SQL string parsing

complex SQL string parsing - sql-server

I have the following text field in SQL Server table:
1!1,3!0,23!0,288!0,340!0,521!0,24!0,38!0,26!0,27!0,281!0,19!0,470!0,568!0,601!0,2!1,251!0,7!2,140!0,285!0,11!2,33!0
Would like to retrieve only the part before the exclamation mark (!). So for 1!1 I only want 1, for 3!0 I only want 3, for 23!0 I only want 23.
Would also like to retrieve only the part after the exclamation mark (!). So for 1!1 I only want 1, for 3!0 I only want 0, for 23!0 I only want 0.
Both point 1 and point 2 should be inserted into separate columns of a SQL Server table.

I LOVE SQL Server's XML capabilities. It is a great way to parse data. Try this one out:
--Load the original string
DECLARE #string nvarchar(max) = '1!2,3!4,5!6,7!8,9!10';
--Turn it into XML
SET #string = REPLACE(#string,',','</SecondNumber></Pair><Pair><FirstNumber>') + '</SecondNumber></Pair>';
SET #string = '<Pair><FirstNumber>' + REPLACE(#string,'!','</FirstNumber><SecondNumber>');
--Show the new version of the string
SELECT #string AS XmlIfiedString;
--Load it into an XML variable
DECLARE #xml XML = #string;
--Now, First and Second Number from each pair...
SELECT
Pairs.Pair.value('FirstNumber[1]','nvarchar(1024)') AS FirstNumber,
Pairs.Pair.value('SecondNumber[1]','nvarchar(1024)') AS SecondNumber
FROM #xml.nodes('//*:Pair') Pairs(Pair);
The above query turned the string into XML like this:
<Pair><FirstNumber>1</FirstNumber><SecondNumber>2</SecondNumber></Pair> ...
Then parsed it to return a result like:
FirstNumber | SecondNumber
----------- | ------------
1 | 2
3 | 4
5 | 6
7 | 8
9 | 10

I completely agree with the guys complaining about this sort of data.
The fact however, is that we often don't have any control of the format of our sources.
Here's my approach...
First you need a tokeniser. This one is very efficient (probably the fastest non-CLR). Found at http://www.sqlservercentral.com/articles/Tally+Table/72993/
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
GO
Then you consume it like so...
DECLARE #Wtf VARCHAR(1000) = '1!1,3!0,23!0,288!0,340!0,521!0,24!0,38!0,26!0,27!0,281!0,19!0,470!0,568!0,601!0,2!1,251!0,7!2,140!0,285!0,11!2,33!0'
SELECT LEFT(Item, CHARINDEX('!', Item)-1)
,RIGHT(Item, CHARINDEX('!', REVERSE(Item))-1)
FROM [dbo].[DelimitedSplit8K](#Wtf, ',')
The function posted and logic for parsing can be integrated in to a single function of course.

I agree to normaliz the data is the best way. However, here is the XML solution to parse the data
DECLARE #str VARCHAR(1000) = '1!1,3!0,23!0,288!0,340!0,521!0,24!0,38!0,26!0,27!0,281!0,19!0,470!0,568!0,601!0,2!1,251!0,7!2,140!0,285!0,11!2,33!0'
,#xml XML
SET #xml = CAST('<row><col>' + REPLACE(REPLACE(#str,'!','</col><col>'),',','</col></row><row><col>') + '</col></row>' AS XML)
SELECT
line.col.value('col[1]', 'varchar(1000)') AS col1
,line.col.value('col[2]', 'varchar(1000)') AS col2
FROM #xml.nodes('/row') AS line(col)

Related

how to generate individual rows for each character between two delimiters in a string

I have a data set with square brackets.
CREATE TABLE Testdata
(
SomeID INT,
String VARCHAR(MAX)
)
INSERT Testdata SELECT 1, 'S0000X-T859XX[DEFGH]'
INSERT Testdata SELECT 1, 'T880XX-T889XX[DS]'
INSERT Testdata SELECT 2, 'V0001X-Y048XX[DS]'
INSERT Testdata SELECT 2, 'Y0801X-Y0889X[AB]'
i need to get output like below,
SomeId String
1 S0000XD-T859XXD
1 S0000XE-T859XXE
1 S0000XF-T859XXF
1 S0000XG-T859XXG
1 S0000XH-T859XXH
1 T880XXD-T889XXD
1 T880XXS-T889XXS
2 V0001XD-Y048XXD
2 V0001XS-Y048XXS
2 Y0801XA-Y0889XA
2 Y0801XB-Y0889XB
Appreciate if any one can help this

You don't need a function here, and certainly no need for loops. A tally table will make short work of this. First you need a tally table. I keep one as a view on my system. It is nasty fast!!!
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
GO
You don't have use a view like this, you could just use the CTEs directly in your code.
This is some rather ugly string manipulation but since your data is not properly normalized you are kind of stuck there. Please try this query. It produces what you said you expect as output.
select *
, NewOutput = left(td.String, charindex('-', td.String) - 1) + SUBSTRING(td.String, CHARINDEX('[', td.String) + t.N, 1) +
left(substring(td.String, charindex('-', td.String), len(td.String)), charindex('[', substring(td.String, charindex('-', td.String), len(td.String))) - 1) + SUBSTRING(td.String, CHARINDEX('[', td.String) + t.N, 1)
from TestData td
join cteTally t on t.N <= CHARINDEX(']', td.String) - CHARINDEX('[', td.String) - 1
order by td.String
, t.N

I am posting this because I did it.
select distinct *
,[base]+substring(splitter,number,1)
from
(
select SomeID
-- split your column into a base plus a splitter column
,[base] = left(string,charindex('[',string)-1)
,splitter = substring(string, charindex('[',string)+1,len(string) - charindex('[',string)-1)
from
(
-- converted your insert into a union all
SELECT 1 SomeID, 'S0000X-T859XX[DEFGH]' string
union all
SELECT 1, 'T880XX-T889XX[DS]'
union all
SELECT 2, 'V0001X-Y048XX[DS]'
union all
SELECT 2, 'Y0801X-Y0889X[AB]'
) a
) inside
cross apply (Select number from master..spt_values where number>0 and number <=len(splitter)) b -- this is similar to a tally table using an internal SQL table

Since you didn't include any attempt to solve this, I will assume that you are not stuck on any particular technique, and are looking for a high-level approach.
I would solve this by first creating a Table-valued function that takes the id and the string as parameters, and creates an output table by looping through the characters in between the square-brackets, so that the function's output looks like this
ID String Character
1 S0000X-T859XX D
1 S0000X-T859XX E
1 S0000X-T859XX F
..
2 V0001XD-Y048XX D
etc...
Then you simply write a query that joins the raw table to the function on the ID and the portion of the String without the brackets.

Matching string with LEVENSHTEIN algorithm

create table tbl1
(
name varchar(50)
);
insert into tbl1 values ('Mircrosoft SQL Server'),
('Office Microsoft');
create table tbl2
(
name varchar(50)
);
insert into tbl2 values ('SQL Server Microsoft'),
('Microsoft Office');
I want to get the percentage of matching string between two tables column name.
I tried with LEVENSHTEIN algorithm. But what I want to achieve from given data is same between the tables but with different sequence so I want to see the output as 100% matching.
Tried: LEVENSHTEIN
SELECT [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) MatchedPercentage,a.name as tbl1_name,b.name as tbl2_name
FROM tbl1 a
CROSS JOIN tbl2 b
WHERE [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) >= 0;
Result:
MatchedPercentage tbl1_name tbl2_name
-----------------------------------------------------------------
5 Mircrosoft SQL Server SQL Server Microsoft
10 Office Microsoft SQL Server Microsoft
15 Mircrosoft SQL Server Microsoft Office
13 Office Microsoft Microsoft Office

As mentioned in the comments this can be achieved through the use of a string split table valued function. Personally I use one based on the very performant set-based tally table approach put together by Jeff Moden which is at the end of my answer.
Using this function allows you to compare the individual words as delimited by a space character and count up the number of matches compared to the total number of words in the two values.
Do note however that this solution falls over on any values with leading spaces. If this will be a problem, clean your data before running this script or adjust to handle them:
declare #t1 table(v nvarchar(50));
declare #t2 table(v nvarchar(50));
insert into #t1 values('Microsoft SQL Server'),('Office Microsoft'),('Other values'); -- Add in some extra values, with the same number of words and some with the same number of characters
insert into #t2 values('SQL Server Microsoft'),('Microsoft Office'),('that matched'),('that didn''t'),('Other valuee');
with c as
(
select t1.v as v1
,t2.v as v2
,len(t1.v) - len(replace(t1.v,' ','')) + 1 as NumWords -- String Length - String Length without spaces = Number of words - 1
from #t1 as t1
cross join #t2 as t2 -- Cross join the two tables to get all comparisons
where len(replace(t1.v,' ','')) = len(replace(t2.v,' ','')) -- Where the length without spaces is the same. Can't have the same words in a different order if the number of non space characters in the whole string is different
)
select c.v1
,c.v2
,c.NumWords
,sum(case when s1.item = s2.item then 1 else 0 end) as MatchedWords
from c
cross apply dbo.fn_StringSplit4k(c.v1,' ',null) as s1
cross apply dbo.fn_StringSplit4k(c.v2,' ',null) as s2
group by c.v1
,c.v2
,c.NumWords
having c.NumWords = sum(case when s1.item = s2.item then 1 else 0 end);
Output
+----------------------+----------------------+----------+--------------+
| v1 | v2 | NumWords | MatchedWords |
+----------------------+----------------------+----------+--------------+
| Microsoft SQL Server | SQL Server Microsoft | 3 | 3 |
| Office Microsoft | Microsoft Office | 2 | 2 |
+----------------------+----------------------+----------+--------------+
Function
create function dbo.fn_StringSplit4k
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;

Store multiple comma separated strings into temp table

Given strings:
string 1: 'A,B,C,D,E'
string 2: 'X091,X089,X051,X043,X023'
Want to store into the temp table as:
String1 String2
---------------------
A X091
B X089
C X051
D X043
E X023
Tried: Created user defined function named udf_split and inserting into table for each column.
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1) SELECT Item FROM udf_split(#Str1,',');
INSERT INTO #TestString(string2) SELECT Item FROM udf_split(#Str2,',');
But getting following result:
SELECT * FROM #TestString
string1 string2
-----------------
A NULL
B NULL
C NULL
D NULL
E NULL
NULL X091
NULL X089
NULL X051
NULL X043
NULL X023

You need to insert both parts of each row together. Otherwise you end up with each row having one column containing null - which is exactly what happened in your attempt.
First, I would recommend not messing around with comma delimited strings in the database at all, if that's possible.
If you have control over the input, you better use table variables or xml.
If you don't, to split strings in any version under 2016 I would recommend first to read Aaron Bertrand's Split strings the right way – or the next best way. IN 2016 you should use the built in string_split function.
For this kind of thing you want to use a splitting function that returns both the item and it's location in the original string. Lucky for you, Jeff Moden have already written such a splitting function and it's a very popular, high performance function.
You can read all about it on Tally OH! An Improved SQL 8K “CSV Splitter” Function.
So, here is Jeff's function:
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
and here is how you use it:
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1, string2)
SELECT A.Item, B.Item
FROM DelimitedSplit8K(#Str1,',') A
JOIN DelimitedSplit8K(#Str2,',') B
ON A.ItemNumber = B.ItemNumber;

My suggestion uses two independant recursive splits. This allows us to transform the two strings in two sets with a position index. The final SELECT will join the two sets on their position index and return a sorted list:
DECLARE #str1 VARCHAR(1000)= 'A,B,C,D,E';
DECLARE #str2 VARCHAR(1000)= 'X091,X089,X051,X043,X023';
;WITH
--split the first string
a1 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str1, j+1) FROM a1 WHERE j > i)
,b1 AS (SELECT n, SUBSTRING(#str1, i+1, IIF(j>0, j, LEN(#str1)+1)-i-1) s FROM a1 WHERE i >= 0)
--split the second string
,a2 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str2, j+1) FROM a2 WHERE j > i)
,b2 AS (SELECT n, SUBSTRING(#str2, i+1, IIF(j>0, j, LEN(#str2)+1)-i-1) s FROM a2 WHERE i >= 0)
--join them by the index
SELECT b1.n
,b1.s AS s1
,b2.s AS s2
FROM b1
INNER JOIN b2 ON b1.n=b2.n
ORDER BY b1.n;
The result
n s1 s2
1 A X091
2 B X089
3 C X051
4 D X043
5 E X023
UPDATE: If you have v2016+...
With SQL-Server 2016+ you can use OPENJSON with a tiny string replacement to transform your CSV strings to a JSON-array:
SELECT a.[key]
,a.value AS s1
,b.value AS s2
FROM OPENJSON('["' + REPLACE(#str1,',','","') + '"]') a
INNER JOIN(SELECT * FROM OPENJSON('["' + REPLACE(#str2,',','","') + '"]')) b ON a.[key]=b.[key]
ORDER BY a.[key];
Other than STRING_SPLIT() this approach returns the key as zero-based index of the element within the array. The result is the same...

Compare the two tables and update the value in a Flag column

I have two tables and the values like this
`create table InputLocationTable(SKUID int,InputLocations varchar(100),Flag varchar(100))
create table Location(SKUID int,Locations varchar(100))
insert into InputLocationTable(SKUID,InputLocations) values(11,'Loc1, Loc2, Loc3, Loc4, Loc5, Loc6')
insert into InputLocationTable(SKUID,InputLocations) values(12,'Loc1, Loc2')
insert into InputLocationTable(SKUID,InputLocations) values(13,'Loc4,Loc5')
insert into Location(SKUID,Locations) values(11,'Loc3')
insert into Location(SKUID,Locations) values(11,'Loc4')
insert into Location(SKUID,Locations) values(11,'Loc5')
insert into Location(SKUID,Locations) values(11,'Loc7')
insert into Location(SKUID,Locations) values(12,'Loc10')
insert into Location(SKUID,Locations) values(12,'Loc1')
insert into Location(SKUID,Locations) values(12,'Loc5')
insert into Location(SKUID,Locations) values(13,'Loc4')
insert into Location(SKUID,Locations) values(13,'Loc2')
insert into Location(SKUID,Locations) values(13,'Loc2')`
I need to get the output by matching SKUID's from Each tables and Update the value in Flag column as shown in the screenshot, I have tried something like this code
`SELECT STUFF((select ','+ Data.C1
FROM
(select
n.r.value('.', 'varchar(50)') AS C1
from InputLocation as T
cross apply (select cast('<r>'+replace(replace(Location,'&','&'), ',', '</r><r>')+'</r>' as xml)) as S(XMLCol)
cross apply S.XMLCol.nodes('r') as n(r)) DATA
WHERE data.C1 NOT IN (SELECT Location
FROM Location) for xml path('')),1,1,'') As Output`
But not convinced with output and also i am trying to avoid xml path code, because performance is not first place for this code, I need the output like the below screenshot. Any help would be greatly appreciated.

I think you need to first look at why you think the XML approach is not performing well enough for your needs, as it has actually been shown to perform very well for larger input strings.
If you only need to handle input strings of up to either 4000 or 8000 characters (non max nvarchar and varchar types respectively), you can utilise a tally table contained within an inline table valued function which will also perform very well. The version I use can be found at the end of this post.
Utilising this function we can split out the values in your InputLocations column, though we still need to use for xml to concatenate them back together for your desired format:
-- Define data
declare #InputLocationTable table (SKUID int,InputLocations varchar(100),Flag varchar(100));
declare #Location table (SKUID int,Locations varchar(100));
insert into #InputLocationTable(SKUID,InputLocations) values (11,'Loc1, Loc2, Loc3, Loc4, Loc5, Loc6'),(12,'Loc1, Loc2'),(13,'Loc4,Loc5'),(14,'Loc1');
insert into #Location(SKUID,Locations) values (11,'Loc3'),(11,'Loc4'),(11,'Loc5'),(11,'Loc7'),(12,'Loc10'),(12,'Loc1'),(12,'Loc5'),(13,'Loc4'),(13,'Loc2'),(13,'Loc2'),(14,'Loc1');
--Query
-- Derived table splits out the values held within the InputLocations column
with i as
(
select i.SKUID
,i.InputLocations
,s.item as Loc
from #InputLocationTable as i
cross apply dbo.fn_StringSplit4k(replace(i.InputLocations,' ',''),',',null) as s
)
select il.SKUID
,il.InputLocations
,isnull('Add ' -- The split Locations are then matched to those already in #Location and those not present are concatenated together.
+ stuff((select ', ' + i.Loc
from i
left join #Location as l
on i.SKUID = l.SKUID
and i.Loc = l.Locations
where il.SKUID = i.SKUID
and l.SKUID is null
for xml path('')
)
,1,2,''
)
,'No Flag') as Flag
from #InputLocationTable as il
order by il.SKUID;
Output:
+-------+------------------------------------+----------------------+
| SKUID | InputLocations | Flag |
+-------+------------------------------------+----------------------+
| 11 | Loc1, Loc2, Loc3, Loc4, Loc5, Loc6 | Add Loc1, Loc2, Loc6 |
| 12 | Loc1, Loc2 | Add Loc2 |
| 13 | Loc4,Loc5 | Add Loc5 |
| 14 | Loc1 | No Flag |
+-------+------------------------------------+----------------------+
For nvarchar input (I have different functions for varchar and max type input) this is my version of the string splitting function linked above:
create function [dbo].[fn_StringSplit4k]
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value in the list to return. NULL returns all.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
go

How to compare if two strings contain the same words in SQL Server 2014?

I'm trying to solve the same problem as in this question, but this time in SQL Server 2014. I need to check if strings are made out of the same words:
Returns true for:
Antoine de Saint-Exupéry = de Saint-Exupéry Antoine = Saint-Exupéry Antoine de = etc.
and
Returns false for:
Antoine de Saint-Exupéry != Antoine de Saint != Antoine Antoine de Saint-Exupéry != etc.
What are my options in SQL Server 2014? Is there a built-in function for such comparison?

To compare 2 strings, one could abuse use the sorting capability in XQuery.
Cast the string to an XML, sort the elements and then return a string without the tags.
For example:
DECLARE #Words1 NVARCHAR(MAX) = N'Antoine de Saint-Exupéry';
DECLARE #Words2 NVARCHAR(MAX) = N'Saint-Exupéry Antoine de';
DECLARE #SortedWords1 NVARCHAR(MAX) = cast('<x>'+replace(#Words1,' ','</x><x>')+'</x>' as XML).query('for $x in /x order by $x ascending return $x').value('.','nvarchar(max)');
DECLARE #SortedWords2 NVARCHAR(MAX) = cast('<x>'+replace(#Words2,' ','</x><x>')+'</x>' as XML).query('for $x in /x order by $x ascending return $x').value('.','nvarchar(max)');
DECLARE #SameWords BIT = (case
when #SortedWords1 = #SortedWords2
then 1
else 0
end);
SELECT #SameWords as SameWords;
Returns:
SameWords
---------
True

Here is one way you could roll your own for this. I am using the string splitter from Jeff Moden. You can find the original article here. http://www.sqlservercentral.com/articles/Tally+Table/72993/. If you don't like that splitter there are some other great versions here. https://sqlperformance.com/2012/07/t-sql-queries/split-strings. I like the one from Jeff Moden because unlike any of the other splitters you get the ItemNumber returned which in some cases is incredibly useful.
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
The basic concept here is that you have to split your strings into words and then do a comparison. I used a couple of ctes so it is move obvious the process of how this works. The following works for all of the examples you posted.
declare #Phrase1 nvarchar(100) = 'Antoine de Saint-Exupéry'
, #Phrase2 nvarchar(100) = 'de Saint-Exupéry Antoine'
;
with Phrase1 as
(
select *
from DelimitedSplit8K(#Phrase1, ' ')
)
, Phrase2 as
(
select *
from DelimitedSplit8K(#Phrase2, ' ')
)
select PhrasesEqual = convert(bit, case when count(*) > 0 then 1 else 0 end)
from Phrase1 p1
full outer join Phrase2 p2 on p2.Item = p1.Item
where p1.Item is null
or p2.Item is null
;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

complex SQL string parsing - sql-server

Related

how to generate individual rows for each character between two delimiters in a string

Matching string with LEVENSHTEIN algorithm

Store multiple comma separated strings into temp table

Compare the two tables and update the value in a Flag column

How to compare if two strings contain the same words in SQL Server 2014?

Categories

Resources