Store multiple comma separated strings into temp table - sql-server

Given strings:
string 1: 'A,B,C,D,E'
string 2: 'X091,X089,X051,X043,X023'
Want to store into the temp table as:
String1 String2
---------------------
A X091
B X089
C X051
D X043
E X023
Tried: Created user defined function named udf_split and inserting into table for each column.
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1) SELECT Item FROM udf_split(#Str1,',');
INSERT INTO #TestString(string2) SELECT Item FROM udf_split(#Str2,',');
But getting following result:
SELECT * FROM #TestString
string1 string2
-----------------
A NULL
B NULL
C NULL
D NULL
E NULL
NULL X091
NULL X089
NULL X051
NULL X043
NULL X023

You need to insert both parts of each row together. Otherwise you end up with each row having one column containing null - which is exactly what happened in your attempt.
First, I would recommend not messing around with comma delimited strings in the database at all, if that's possible.
If you have control over the input, you better use table variables or xml.
If you don't, to split strings in any version under 2016 I would recommend first to read Aaron Bertrand's Split strings the right way – or the next best way. IN 2016 you should use the built in string_split function.
For this kind of thing you want to use a splitting function that returns both the item and it's location in the original string. Lucky for you, Jeff Moden have already written such a splitting function and it's a very popular, high performance function.
You can read all about it on Tally OH! An Improved SQL 8K “CSV Splitter” Function.
So, here is Jeff's function:
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
and here is how you use it:
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1, string2)
SELECT A.Item, B.Item
FROM DelimitedSplit8K(#Str1,',') A
JOIN DelimitedSplit8K(#Str2,',') B
ON A.ItemNumber = B.ItemNumber;

My suggestion uses two independant recursive splits. This allows us to transform the two strings in two sets with a position index. The final SELECT will join the two sets on their position index and return a sorted list:
DECLARE #str1 VARCHAR(1000)= 'A,B,C,D,E';
DECLARE #str2 VARCHAR(1000)= 'X091,X089,X051,X043,X023';
;WITH
--split the first string
a1 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str1, j+1) FROM a1 WHERE j > i)
,b1 AS (SELECT n, SUBSTRING(#str1, i+1, IIF(j>0, j, LEN(#str1)+1)-i-1) s FROM a1 WHERE i >= 0)
--split the second string
,a2 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str2, j+1) FROM a2 WHERE j > i)
,b2 AS (SELECT n, SUBSTRING(#str2, i+1, IIF(j>0, j, LEN(#str2)+1)-i-1) s FROM a2 WHERE i >= 0)
--join them by the index
SELECT b1.n
,b1.s AS s1
,b2.s AS s2
FROM b1
INNER JOIN b2 ON b1.n=b2.n
ORDER BY b1.n;
The result
n s1 s2
1 A X091
2 B X089
3 C X051
4 D X043
5 E X023
UPDATE: If you have v2016+...
With SQL-Server 2016+ you can use OPENJSON with a tiny string replacement to transform your CSV strings to a JSON-array:
SELECT a.[key]
,a.value AS s1
,b.value AS s2
FROM OPENJSON('["' + REPLACE(#str1,',','","') + '"]') a
INNER JOIN(SELECT * FROM OPENJSON('["' + REPLACE(#str2,',','","') + '"]')) b ON a.[key]=b.[key]
ORDER BY a.[key];
Other than STRING_SPLIT() this approach returns the key as zero-based index of the element within the array. The result is the same...

Related

SQL Server string to m x n table

There are scores of excellent posts here related to splitting a string in SQL.
I am looking however for a true string to table function.
In other words, taking a string that represents a table (with multiple rows and columns) and generating the equivalent table in SQL Server.
This means that there is an inner delimiter, and an outer delimiter.
The string needs to be split by the outer delimiter into rows, and then each row turned into a set of columns.
I have not been able to find anything posted here, so I thought I'd submit my version and solicit feedback.
In my particular usage scenarios, I know that there are no more than 10 columns per row, and that the total size is under varchar(8000).
Here's the tfSplitString function (listed just for completeness; you can use your own version as you like) followed by the dbo.tfStringToTable10Cols. Since I don't know the column names, they are simply output as Col1, Col2, Col3, etc.
if object_id('dbo.tfSplitString') is not null
exec (' DROP FUNCTION [dbo].[tfSplitString]');
go
CREATE FUNCTION [dbo].[tfSplitString]
/**********************************************************************************************************************
Purpose:
Split a given string at a given delimiter and return a list of the split elements (items).
Notes:
1. Leading a trailing delimiters are treated as if an empty string element were present.
2. Consecutive delimiters are treated as if an empty string element were present between them.
3. Except when spaces are used as a delimiter, all spaces present in each element are preserved.
select * from [dbo].[tfSplitString]('This is a test of the emergency breadcast system', ' ')
Returns:
1 This
2 is
3 a
4 test
5 of
6 the
7 emergency
8 breadcast
9 system
iTVF containing the following:
ItemNumber = Element position of Item as a BIGINT (not converted to INT to eliminate a CAST)
Item = Element value as a VARCHAR(8000)
CROSS APPLY Usage Examples and Tests:
--=====================================================================================================================
-- TEST 1:
-- This tests for various possible conditions in a string using a comma as the delimiter. The expected results are
-- laid out in the comments
--=====================================================================================================================
--===== Conditionally drop the test tables to make reruns easier for testing.
-- (this is NOT a part of the solution)
IF OBJECT_ID('tempdb..#JBMTest') IS NOT NULL DROP TABLE #JBMTest
;
--===== Create and populate a test table on the fly (this is NOT a part of the solution).
-- In the following comments, "b" is a blank and "E" is an element in the left to right order.
-- Double Quotes are used to encapsulate the output of "Item" so that you can see that all blanks
-- are preserved no matter where they may appear.
SELECT *
INTO #JBMTest
FROM ( --# & type of Return Row(s)
SELECT 0, NULL UNION ALL --1 NULL
SELECT 1, SPACE(0) UNION ALL --1 b (Empty String)
SELECT 2, SPACE(1) UNION ALL --1 b (1 space)
SELECT 3, SPACE(5) UNION ALL --1 b (5 spaces)
SELECT 4, ',' UNION ALL --2 b b (both are empty strings)
SELECT 5, '55555' UNION ALL --1 E
SELECT 6, ',55555' UNION ALL --2 b E
SELECT 7, ',55555,' UNION ALL --3 b E b
SELECT 8, '55555,' UNION ALL --2 b B
SELECT 9, '55555,1' UNION ALL --2 E E
SELECT 10, '1,55555' UNION ALL --2 E E
SELECT 11, '55555,4444,333,22,1' UNION ALL --5 E E E E E
SELECT 12, '55555,4444,,333,22,1' UNION ALL --6 E E b E E E
SELECT 13, ',55555,4444,,333,22,1,' UNION ALL --8 b E E b E E E b
SELECT 14, ',55555,4444,,,333,22,1,' UNION ALL --9 b E E b b E E E b
SELECT 15, ' 4444,55555 ' UNION ALL --2 E (w/Leading Space) E (w/Trailing Space)
SELECT 16, 'This,is,a,test.' --E E E E
) d (SomeID, SomeValue)
;
--===== Split the CSV column for the whole table using CROSS APPLY (this is the solution)
SELECT test.SomeID, test.SomeValue, split.ItemNumber, Item = QUOTENAME(split.Item,'"')
FROM #JBMTest test
CROSS APPLY dbo.tfSplitString(test.SomeValue,',') split
;
--=====================================================================================================================
-- TEST 2:
-- This tests for various "alpha" splits and COLLATION using all ASCII characters from 0 to 255 as a delimiter against
-- a given string. Note that not all of the delimiters will be visible and some will show up as tiny squares because
-- they are "control" characters. More specifically, this test will show you what happens to various non-accented
-- letters for your given collation depending on the delimiter you chose.
--=====================================================================================================================
WITH
cteBuildAllCharacters (String,Delimiter) AS
(
SELECT TOP 256
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
CHAR(ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1)
FROM master.sys.all_columns
)
SELECT ASCII_Value = ASCII(c.Delimiter), c.Delimiter, split.ItemNumber, Item = QUOTENAME(split.Item,'"')
FROM cteBuildAllCharacters c
CROSS APPLY dbo.tfSplitString(c.String,c.Delimiter) split
ORDER BY ASCII_Value, split.ItemNumber
;
--===== Define I/O parameters
*/
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
with
cte1(prevPos, nextPos) as ( -- previous position of the delimiter; next position of the delimiter
select 0, CHARINDEX(#pDelimiter, #pString)
union all
select nextPos, CHARINDEX(#pDelimiter, #pString, nextPos+1)
from cte1
where nextPos>0
)
select row_number() over (order by (select null)) ItemNumber
, SUBSTRING(#pString, prevPos+1, case when nextPos=0 then len(#pString) else nextPos-(prevPos+1) end) Item
from cte1;
GO
if object_id('dbo.tfStringToTable10Cols') is not null exec (' DROP FUNCTION [dbo].[tfStringToTable10Cols]');
go
Schema_drop_function 'tfStringToTable10Cols'
go
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
/**
Inlined Table-valued function to deserialize a string into an n by 10 table.
Column values must not contain either the inner or outer delimiter.
Empty rows are ignored.
Note: This could trivially be converted to return more columns if a use case arises
Example:
select * from dbo.tfStringToTable20Cols('a1,b1,c1,d1,e1,f1,g1,h1,i1,j1;;,b2,c2;a3,,c3,d3,e3,f3,g3,h3,i3,j3', DEFAULT, DEFAULT)
should return
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
a1 b1 c1 d1 e1 f1 g1 h1 i1 j1
b2 c2 NULL NULL NULL NULL NULL NULL NULL
a3 c3 d3 e3 f3 g3 h3 i3 j3
*/
create function dbo.tfStringToTable10Cols
(
#Series varchar(8000),
#OuterDelimiter char(1) = ';',
#InnerDelimiter char(1) = ','
)
RETURNS TABLE
AS
RETURN
select p.[1] Col1, p.[2] Col2, p.[3] Col3, p.[4] Col4, p.[5] Col5, p.[6] Col6, p.[7] Col7, p.[8] Col8, p.[9] Col9, p.[10] Col10
from dbo.tfSplitString(#Series, #OuterDelimiter) as r -- This breaks up the strings into serialized rowstrings
cross apply (
select * from dbo.tfSplitString(r.Item, #InnerDelimiter) cols -- This breaks up the rowstrings into columnNumber and columnVal rows
pivot (Max(cols.Item) -- The pivot then turns these rows into columns. The Max() is a no-op since each columnNumber is unique
for cols.ItemNumber in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10]))
as tmp) as P
where len(r.Item)>0
go
select * from dbo.tfStringToTable10Cols('a1,b1,c1,d1,e1,f1,g1,h1,i1,j1;,b2,c2;a3,,c3,d3,e3,f3,g3,h3,i3,j3', DEFAULT, DEFAULT)```
Just for fun.
If by chance you are on 2016+, you could take advantage of the JSON capabilities, and step away from your tfSplitString function
Example
Declare #Series varchar(max) = 'a1,b1,c1,d1,e1,f1,g1,h1,i1,j1;;,b2,c2;a3,,c3,d3,e3,f3,g3,h3,i3,j3'
Declare #OuterDelimiter varchar(25) = ';'
Declare #InnerDelimiter varchar(25) = ','
Select B.*
From (
Select RetSeq = [Key]+1
,RetVal = Value
From OpenJSON( '["'+replace(replace(#Series,'"','\"'),#OuterDelimiter,'","')+'"]' )
) A
Cross Apply (
Select Pos1 = JSON_VALUE(S,'$[0]')
,Pos2 = JSON_VALUE(S,'$[1]')
,Pos3 = JSON_VALUE(S,'$[2]')
,Pos4 = JSON_VALUE(S,'$[3]')
,Pos5 = JSON_VALUE(S,'$[4]')
,Pos6 = JSON_VALUE(S,'$[5]')
,Pos7 = JSON_VALUE(S,'$[6]')
,Pos8 = JSON_VALUE(S,'$[7]')
,Pos9 = JSON_VALUE(S,'$[8]')
,Pos10 = JSON_VALUE(S,'$[9]')
From ( values ( '["'+replace(replace(A.RetVal,'"','\"'),#InnerDelimiter,'","')+'"]' ) ) A(S)
) B
Where A.RetVal <>''
Returns
Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 Pos8 Pos9 Pos10
a1 b1 c1 d1 e1 f1 g1 h1 i1 j1
b2 c2 NULL NULL NULL NULL NULL NULL NULL
a3 c3 d3 e3 f3 g3 h3 i3 j3

how to generate individual rows for each character between two delimiters in a string

I have a data set with square brackets.
CREATE TABLE Testdata
(
SomeID INT,
String VARCHAR(MAX)
)
INSERT Testdata SELECT 1, 'S0000X-T859XX[DEFGH]'
INSERT Testdata SELECT 1, 'T880XX-T889XX[DS]'
INSERT Testdata SELECT 2, 'V0001X-Y048XX[DS]'
INSERT Testdata SELECT 2, 'Y0801X-Y0889X[AB]'
i need to get output like below,
SomeId String
1 S0000XD-T859XXD
1 S0000XE-T859XXE
1 S0000XF-T859XXF
1 S0000XG-T859XXG
1 S0000XH-T859XXH
1 T880XXD-T889XXD
1 T880XXS-T889XXS
2 V0001XD-Y048XXD
2 V0001XS-Y048XXS
2 Y0801XA-Y0889XA
2 Y0801XB-Y0889XB
Appreciate if any one can help this
You don't need a function here, and certainly no need for loops. A tally table will make short work of this. First you need a tally table. I keep one as a view on my system. It is nasty fast!!!
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
GO
You don't have use a view like this, you could just use the CTEs directly in your code.
This is some rather ugly string manipulation but since your data is not properly normalized you are kind of stuck there. Please try this query. It produces what you said you expect as output.
select *
, NewOutput = left(td.String, charindex('-', td.String) - 1) + SUBSTRING(td.String, CHARINDEX('[', td.String) + t.N, 1) +
left(substring(td.String, charindex('-', td.String), len(td.String)), charindex('[', substring(td.String, charindex('-', td.String), len(td.String))) - 1) + SUBSTRING(td.String, CHARINDEX('[', td.String) + t.N, 1)
from TestData td
join cteTally t on t.N <= CHARINDEX(']', td.String) - CHARINDEX('[', td.String) - 1
order by td.String
, t.N
I am posting this because I did it.
select distinct *
,[base]+substring(splitter,number,1)
from
(
select SomeID
-- split your column into a base plus a splitter column
,[base] = left(string,charindex('[',string)-1)
,splitter = substring(string, charindex('[',string)+1,len(string) - charindex('[',string)-1)
from
(
-- converted your insert into a union all
SELECT 1 SomeID, 'S0000X-T859XX[DEFGH]' string
union all
SELECT 1, 'T880XX-T889XX[DS]'
union all
SELECT 2, 'V0001X-Y048XX[DS]'
union all
SELECT 2, 'Y0801X-Y0889X[AB]'
) a
) inside
cross apply (Select number from master..spt_values where number>0 and number <=len(splitter)) b -- this is similar to a tally table using an internal SQL table
Since you didn't include any attempt to solve this, I will assume that you are not stuck on any particular technique, and are looking for a high-level approach.
I would solve this by first creating a Table-valued function that takes the id and the string as parameters, and creates an output table by looping through the characters in between the square-brackets, so that the function's output looks like this
ID String Character
1 S0000X-T859XX D
1 S0000X-T859XX E
1 S0000X-T859XX F
..
2 V0001XD-Y048XX D
etc...
Then you simply write a query that joins the raw table to the function on the ID and the portion of the String without the brackets.

Matching string with LEVENSHTEIN algorithm

create table tbl1
(
name varchar(50)
);
insert into tbl1 values ('Mircrosoft SQL Server'),
('Office Microsoft');
create table tbl2
(
name varchar(50)
);
insert into tbl2 values ('SQL Server Microsoft'),
('Microsoft Office');
I want to get the percentage of matching string between two tables column name.
I tried with LEVENSHTEIN algorithm. But what I want to achieve from given data is same between the tables but with different sequence so I want to see the output as 100% matching.
Tried: LEVENSHTEIN
SELECT [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) MatchedPercentage,a.name as tbl1_name,b.name as tbl2_name
FROM tbl1 a
CROSS JOIN tbl2 b
WHERE [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) >= 0;
Result:
MatchedPercentage tbl1_name tbl2_name
-----------------------------------------------------------------
5 Mircrosoft SQL Server SQL Server Microsoft
10 Office Microsoft SQL Server Microsoft
15 Mircrosoft SQL Server Microsoft Office
13 Office Microsoft Microsoft Office
As mentioned in the comments this can be achieved through the use of a string split table valued function. Personally I use one based on the very performant set-based tally table approach put together by Jeff Moden which is at the end of my answer.
Using this function allows you to compare the individual words as delimited by a space character and count up the number of matches compared to the total number of words in the two values.
Do note however that this solution falls over on any values with leading spaces. If this will be a problem, clean your data before running this script or adjust to handle them:
declare #t1 table(v nvarchar(50));
declare #t2 table(v nvarchar(50));
insert into #t1 values('Microsoft SQL Server'),('Office Microsoft'),('Other values'); -- Add in some extra values, with the same number of words and some with the same number of characters
insert into #t2 values('SQL Server Microsoft'),('Microsoft Office'),('that matched'),('that didn''t'),('Other valuee');
with c as
(
select t1.v as v1
,t2.v as v2
,len(t1.v) - len(replace(t1.v,' ','')) + 1 as NumWords -- String Length - String Length without spaces = Number of words - 1
from #t1 as t1
cross join #t2 as t2 -- Cross join the two tables to get all comparisons
where len(replace(t1.v,' ','')) = len(replace(t2.v,' ','')) -- Where the length without spaces is the same. Can't have the same words in a different order if the number of non space characters in the whole string is different
)
select c.v1
,c.v2
,c.NumWords
,sum(case when s1.item = s2.item then 1 else 0 end) as MatchedWords
from c
cross apply dbo.fn_StringSplit4k(c.v1,' ',null) as s1
cross apply dbo.fn_StringSplit4k(c.v2,' ',null) as s2
group by c.v1
,c.v2
,c.NumWords
having c.NumWords = sum(case when s1.item = s2.item then 1 else 0 end);
Output
+----------------------+----------------------+----------+--------------+
| v1 | v2 | NumWords | MatchedWords |
+----------------------+----------------------+----------+--------------+
| Microsoft SQL Server | SQL Server Microsoft | 3 | 3 |
| Office Microsoft | Microsoft Office | 2 | 2 |
+----------------------+----------------------+----------+--------------+
Function
create function dbo.fn_StringSplit4k
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;

How to compare if two strings contain the same words in SQL Server 2014?

I'm trying to solve the same problem as in this question, but this time in SQL Server 2014. I need to check if strings are made out of the same words:
Returns true for:
Antoine de Saint-Exupéry = de Saint-Exupéry Antoine = Saint-Exupéry Antoine de = etc.
and
Returns false for:
Antoine de Saint-Exupéry != Antoine de Saint != Antoine Antoine de Saint-Exupéry != etc.
What are my options in SQL Server 2014? Is there a built-in function for such comparison?
To compare 2 strings, one could abuse use the sorting capability in XQuery.
Cast the string to an XML, sort the elements and then return a string without the tags.
For example:
DECLARE #Words1 NVARCHAR(MAX) = N'Antoine de Saint-Exupéry';
DECLARE #Words2 NVARCHAR(MAX) = N'Saint-Exupéry Antoine de';
DECLARE #SortedWords1 NVARCHAR(MAX) = cast('<x>'+replace(#Words1,' ','</x><x>')+'</x>' as XML).query('for $x in /x order by $x ascending return $x').value('.','nvarchar(max)');
DECLARE #SortedWords2 NVARCHAR(MAX) = cast('<x>'+replace(#Words2,' ','</x><x>')+'</x>' as XML).query('for $x in /x order by $x ascending return $x').value('.','nvarchar(max)');
DECLARE #SameWords BIT = (case
when #SortedWords1 = #SortedWords2
then 1
else 0
end);
SELECT #SameWords as SameWords;
Returns:
SameWords
---------
True
Here is one way you could roll your own for this. I am using the string splitter from Jeff Moden. You can find the original article here. http://www.sqlservercentral.com/articles/Tally+Table/72993/. If you don't like that splitter there are some other great versions here. https://sqlperformance.com/2012/07/t-sql-queries/split-strings. I like the one from Jeff Moden because unlike any of the other splitters you get the ItemNumber returned which in some cases is incredibly useful.
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
The basic concept here is that you have to split your strings into words and then do a comparison. I used a couple of ctes so it is move obvious the process of how this works. The following works for all of the examples you posted.
declare #Phrase1 nvarchar(100) = 'Antoine de Saint-Exupéry'
, #Phrase2 nvarchar(100) = 'de Saint-Exupéry Antoine'
;
with Phrase1 as
(
select *
from DelimitedSplit8K(#Phrase1, ' ')
)
, Phrase2 as
(
select *
from DelimitedSplit8K(#Phrase2, ' ')
)
select PhrasesEqual = convert(bit, case when count(*) > 0 then 1 else 0 end)
from Phrase1 p1
full outer join Phrase2 p2 on p2.Item = p1.Item
where p1.Item is null
or p2.Item is null
;

complex SQL string parsing

I have the following text field in SQL Server table:
1!1,3!0,23!0,288!0,340!0,521!0,24!0,38!0,26!0,27!0,281!0,19!0,470!0,568!0,601!0,2!1,251!0,7!2,140!0,285!0,11!2,33!0
Would like to retrieve only the part before the exclamation mark (!). So for 1!1 I only want 1, for 3!0 I only want 3, for 23!0 I only want 23.
Would also like to retrieve only the part after the exclamation mark (!). So for 1!1 I only want 1, for 3!0 I only want 0, for 23!0 I only want 0.
Both point 1 and point 2 should be inserted into separate columns of a SQL Server table.
I LOVE SQL Server's XML capabilities. It is a great way to parse data. Try this one out:
--Load the original string
DECLARE #string nvarchar(max) = '1!2,3!4,5!6,7!8,9!10';
--Turn it into XML
SET #string = REPLACE(#string,',','</SecondNumber></Pair><Pair><FirstNumber>') + '</SecondNumber></Pair>';
SET #string = '<Pair><FirstNumber>' + REPLACE(#string,'!','</FirstNumber><SecondNumber>');
--Show the new version of the string
SELECT #string AS XmlIfiedString;
--Load it into an XML variable
DECLARE #xml XML = #string;
--Now, First and Second Number from each pair...
SELECT
Pairs.Pair.value('FirstNumber[1]','nvarchar(1024)') AS FirstNumber,
Pairs.Pair.value('SecondNumber[1]','nvarchar(1024)') AS SecondNumber
FROM #xml.nodes('//*:Pair') Pairs(Pair);
The above query turned the string into XML like this:
<Pair><FirstNumber>1</FirstNumber><SecondNumber>2</SecondNumber></Pair> ...
Then parsed it to return a result like:
FirstNumber | SecondNumber
----------- | ------------
1 | 2
3 | 4
5 | 6
7 | 8
9 | 10
I completely agree with the guys complaining about this sort of data.
The fact however, is that we often don't have any control of the format of our sources.
Here's my approach...
First you need a tokeniser. This one is very efficient (probably the fastest non-CLR). Found at http://www.sqlservercentral.com/articles/Tally+Table/72993/
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
GO
Then you consume it like so...
DECLARE #Wtf VARCHAR(1000) = '1!1,3!0,23!0,288!0,340!0,521!0,24!0,38!0,26!0,27!0,281!0,19!0,470!0,568!0,601!0,2!1,251!0,7!2,140!0,285!0,11!2,33!0'
SELECT LEFT(Item, CHARINDEX('!', Item)-1)
,RIGHT(Item, CHARINDEX('!', REVERSE(Item))-1)
FROM [dbo].[DelimitedSplit8K](#Wtf, ',')
The function posted and logic for parsing can be integrated in to a single function of course.
I agree to normaliz the data is the best way. However, here is the XML solution to parse the data
DECLARE #str VARCHAR(1000) = '1!1,3!0,23!0,288!0,340!0,521!0,24!0,38!0,26!0,27!0,281!0,19!0,470!0,568!0,601!0,2!1,251!0,7!2,140!0,285!0,11!2,33!0'
,#xml XML
SET #xml = CAST('<row><col>' + REPLACE(REPLACE(#str,'!','</col><col>'),',','</col></row><row><col>') + '</col></row>' AS XML)
SELECT
line.col.value('col[1]', 'varchar(1000)') AS col1
,line.col.value('col[2]', 'varchar(1000)') AS col2
FROM #xml.nodes('/row') AS line(col)

Resources