There are scores of excellent posts here related to splitting a string in SQL.
I am looking however for a true string to table function.
In other words, taking a string that represents a table (with multiple rows and columns) and generating the equivalent table in SQL Server.
This means that there is an inner delimiter, and an outer delimiter.
The string needs to be split by the outer delimiter into rows, and then each row turned into a set of columns.
I have not been able to find anything posted here, so I thought I'd submit my version and solicit feedback.
In my particular usage scenarios, I know that there are no more than 10 columns per row, and that the total size is under varchar(8000).
Here's the tfSplitString function (listed just for completeness; you can use your own version as you like) followed by the dbo.tfStringToTable10Cols. Since I don't know the column names, they are simply output as Col1, Col2, Col3, etc.
if object_id('dbo.tfSplitString') is not null
exec (' DROP FUNCTION [dbo].[tfSplitString]');
go
CREATE FUNCTION [dbo].[tfSplitString]
/**********************************************************************************************************************
Purpose:
Split a given string at a given delimiter and return a list of the split elements (items).
Notes:
1. Leading a trailing delimiters are treated as if an empty string element were present.
2. Consecutive delimiters are treated as if an empty string element were present between them.
3. Except when spaces are used as a delimiter, all spaces present in each element are preserved.
select * from [dbo].[tfSplitString]('This is a test of the emergency breadcast system', ' ')
Returns:
1 This
2 is
3 a
4 test
5 of
6 the
7 emergency
8 breadcast
9 system
iTVF containing the following:
ItemNumber = Element position of Item as a BIGINT (not converted to INT to eliminate a CAST)
Item = Element value as a VARCHAR(8000)
CROSS APPLY Usage Examples and Tests:
--=====================================================================================================================
-- TEST 1:
-- This tests for various possible conditions in a string using a comma as the delimiter. The expected results are
-- laid out in the comments
--=====================================================================================================================
--===== Conditionally drop the test tables to make reruns easier for testing.
-- (this is NOT a part of the solution)
IF OBJECT_ID('tempdb..#JBMTest') IS NOT NULL DROP TABLE #JBMTest
;
--===== Create and populate a test table on the fly (this is NOT a part of the solution).
-- In the following comments, "b" is a blank and "E" is an element in the left to right order.
-- Double Quotes are used to encapsulate the output of "Item" so that you can see that all blanks
-- are preserved no matter where they may appear.
SELECT *
INTO #JBMTest
FROM ( --# & type of Return Row(s)
SELECT 0, NULL UNION ALL --1 NULL
SELECT 1, SPACE(0) UNION ALL --1 b (Empty String)
SELECT 2, SPACE(1) UNION ALL --1 b (1 space)
SELECT 3, SPACE(5) UNION ALL --1 b (5 spaces)
SELECT 4, ',' UNION ALL --2 b b (both are empty strings)
SELECT 5, '55555' UNION ALL --1 E
SELECT 6, ',55555' UNION ALL --2 b E
SELECT 7, ',55555,' UNION ALL --3 b E b
SELECT 8, '55555,' UNION ALL --2 b B
SELECT 9, '55555,1' UNION ALL --2 E E
SELECT 10, '1,55555' UNION ALL --2 E E
SELECT 11, '55555,4444,333,22,1' UNION ALL --5 E E E E E
SELECT 12, '55555,4444,,333,22,1' UNION ALL --6 E E b E E E
SELECT 13, ',55555,4444,,333,22,1,' UNION ALL --8 b E E b E E E b
SELECT 14, ',55555,4444,,,333,22,1,' UNION ALL --9 b E E b b E E E b
SELECT 15, ' 4444,55555 ' UNION ALL --2 E (w/Leading Space) E (w/Trailing Space)
SELECT 16, 'This,is,a,test.' --E E E E
) d (SomeID, SomeValue)
;
--===== Split the CSV column for the whole table using CROSS APPLY (this is the solution)
SELECT test.SomeID, test.SomeValue, split.ItemNumber, Item = QUOTENAME(split.Item,'"')
FROM #JBMTest test
CROSS APPLY dbo.tfSplitString(test.SomeValue,',') split
;
--=====================================================================================================================
-- TEST 2:
-- This tests for various "alpha" splits and COLLATION using all ASCII characters from 0 to 255 as a delimiter against
-- a given string. Note that not all of the delimiters will be visible and some will show up as tiny squares because
-- they are "control" characters. More specifically, this test will show you what happens to various non-accented
-- letters for your given collation depending on the delimiter you chose.
--=====================================================================================================================
WITH
cteBuildAllCharacters (String,Delimiter) AS
(
SELECT TOP 256
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
CHAR(ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1)
FROM master.sys.all_columns
)
SELECT ASCII_Value = ASCII(c.Delimiter), c.Delimiter, split.ItemNumber, Item = QUOTENAME(split.Item,'"')
FROM cteBuildAllCharacters c
CROSS APPLY dbo.tfSplitString(c.String,c.Delimiter) split
ORDER BY ASCII_Value, split.ItemNumber
;
--===== Define I/O parameters
*/
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
with
cte1(prevPos, nextPos) as ( -- previous position of the delimiter; next position of the delimiter
select 0, CHARINDEX(#pDelimiter, #pString)
union all
select nextPos, CHARINDEX(#pDelimiter, #pString, nextPos+1)
from cte1
where nextPos>0
)
select row_number() over (order by (select null)) ItemNumber
, SUBSTRING(#pString, prevPos+1, case when nextPos=0 then len(#pString) else nextPos-(prevPos+1) end) Item
from cte1;
GO
if object_id('dbo.tfStringToTable10Cols') is not null exec (' DROP FUNCTION [dbo].[tfStringToTable10Cols]');
go
Schema_drop_function 'tfStringToTable10Cols'
go
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
/**
Inlined Table-valued function to deserialize a string into an n by 10 table.
Column values must not contain either the inner or outer delimiter.
Empty rows are ignored.
Note: This could trivially be converted to return more columns if a use case arises
Example:
select * from dbo.tfStringToTable20Cols('a1,b1,c1,d1,e1,f1,g1,h1,i1,j1;;,b2,c2;a3,,c3,d3,e3,f3,g3,h3,i3,j3', DEFAULT, DEFAULT)
should return
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
a1 b1 c1 d1 e1 f1 g1 h1 i1 j1
b2 c2 NULL NULL NULL NULL NULL NULL NULL
a3 c3 d3 e3 f3 g3 h3 i3 j3
*/
create function dbo.tfStringToTable10Cols
(
#Series varchar(8000),
#OuterDelimiter char(1) = ';',
#InnerDelimiter char(1) = ','
)
RETURNS TABLE
AS
RETURN
select p.[1] Col1, p.[2] Col2, p.[3] Col3, p.[4] Col4, p.[5] Col5, p.[6] Col6, p.[7] Col7, p.[8] Col8, p.[9] Col9, p.[10] Col10
from dbo.tfSplitString(#Series, #OuterDelimiter) as r -- This breaks up the strings into serialized rowstrings
cross apply (
select * from dbo.tfSplitString(r.Item, #InnerDelimiter) cols -- This breaks up the rowstrings into columnNumber and columnVal rows
pivot (Max(cols.Item) -- The pivot then turns these rows into columns. The Max() is a no-op since each columnNumber is unique
for cols.ItemNumber in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10]))
as tmp) as P
where len(r.Item)>0
go
select * from dbo.tfStringToTable10Cols('a1,b1,c1,d1,e1,f1,g1,h1,i1,j1;,b2,c2;a3,,c3,d3,e3,f3,g3,h3,i3,j3', DEFAULT, DEFAULT)```
Just for fun.
If by chance you are on 2016+, you could take advantage of the JSON capabilities, and step away from your tfSplitString function
Example
Declare #Series varchar(max) = 'a1,b1,c1,d1,e1,f1,g1,h1,i1,j1;;,b2,c2;a3,,c3,d3,e3,f3,g3,h3,i3,j3'
Declare #OuterDelimiter varchar(25) = ';'
Declare #InnerDelimiter varchar(25) = ','
Select B.*
From (
Select RetSeq = [Key]+1
,RetVal = Value
From OpenJSON( '["'+replace(replace(#Series,'"','\"'),#OuterDelimiter,'","')+'"]' )
) A
Cross Apply (
Select Pos1 = JSON_VALUE(S,'$[0]')
,Pos2 = JSON_VALUE(S,'$[1]')
,Pos3 = JSON_VALUE(S,'$[2]')
,Pos4 = JSON_VALUE(S,'$[3]')
,Pos5 = JSON_VALUE(S,'$[4]')
,Pos6 = JSON_VALUE(S,'$[5]')
,Pos7 = JSON_VALUE(S,'$[6]')
,Pos8 = JSON_VALUE(S,'$[7]')
,Pos9 = JSON_VALUE(S,'$[8]')
,Pos10 = JSON_VALUE(S,'$[9]')
From ( values ( '["'+replace(replace(A.RetVal,'"','\"'),#InnerDelimiter,'","')+'"]' ) ) A(S)
) B
Where A.RetVal <>''
Returns
Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 Pos8 Pos9 Pos10
a1 b1 c1 d1 e1 f1 g1 h1 i1 j1
b2 c2 NULL NULL NULL NULL NULL NULL NULL
a3 c3 d3 e3 f3 g3 h3 i3 j3
Related
I have a data set with square brackets.
CREATE TABLE Testdata
(
SomeID INT,
String VARCHAR(MAX)
)
INSERT Testdata SELECT 1, 'S0000X-T859XX[DEFGH]'
INSERT Testdata SELECT 1, 'T880XX-T889XX[DS]'
INSERT Testdata SELECT 2, 'V0001X-Y048XX[DS]'
INSERT Testdata SELECT 2, 'Y0801X-Y0889X[AB]'
i need to get output like below,
SomeId String
1 S0000XD-T859XXD
1 S0000XE-T859XXE
1 S0000XF-T859XXF
1 S0000XG-T859XXG
1 S0000XH-T859XXH
1 T880XXD-T889XXD
1 T880XXS-T889XXS
2 V0001XD-Y048XXD
2 V0001XS-Y048XXS
2 Y0801XA-Y0889XA
2 Y0801XB-Y0889XB
Appreciate if any one can help this
You don't need a function here, and certainly no need for loops. A tally table will make short work of this. First you need a tally table. I keep one as a view on my system. It is nasty fast!!!
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
GO
You don't have use a view like this, you could just use the CTEs directly in your code.
This is some rather ugly string manipulation but since your data is not properly normalized you are kind of stuck there. Please try this query. It produces what you said you expect as output.
select *
, NewOutput = left(td.String, charindex('-', td.String) - 1) + SUBSTRING(td.String, CHARINDEX('[', td.String) + t.N, 1) +
left(substring(td.String, charindex('-', td.String), len(td.String)), charindex('[', substring(td.String, charindex('-', td.String), len(td.String))) - 1) + SUBSTRING(td.String, CHARINDEX('[', td.String) + t.N, 1)
from TestData td
join cteTally t on t.N <= CHARINDEX(']', td.String) - CHARINDEX('[', td.String) - 1
order by td.String
, t.N
I am posting this because I did it.
select distinct *
,[base]+substring(splitter,number,1)
from
(
select SomeID
-- split your column into a base plus a splitter column
,[base] = left(string,charindex('[',string)-1)
,splitter = substring(string, charindex('[',string)+1,len(string) - charindex('[',string)-1)
from
(
-- converted your insert into a union all
SELECT 1 SomeID, 'S0000X-T859XX[DEFGH]' string
union all
SELECT 1, 'T880XX-T889XX[DS]'
union all
SELECT 2, 'V0001X-Y048XX[DS]'
union all
SELECT 2, 'Y0801X-Y0889X[AB]'
) a
) inside
cross apply (Select number from master..spt_values where number>0 and number <=len(splitter)) b -- this is similar to a tally table using an internal SQL table
Since you didn't include any attempt to solve this, I will assume that you are not stuck on any particular technique, and are looking for a high-level approach.
I would solve this by first creating a Table-valued function that takes the id and the string as parameters, and creates an output table by looping through the characters in between the square-brackets, so that the function's output looks like this
ID String Character
1 S0000X-T859XX D
1 S0000X-T859XX E
1 S0000X-T859XX F
..
2 V0001XD-Y048XX D
etc...
Then you simply write a query that joins the raw table to the function on the ID and the portion of the String without the brackets.
create table tbl1
(
name varchar(50)
);
insert into tbl1 values ('Mircrosoft SQL Server'),
('Office Microsoft');
create table tbl2
(
name varchar(50)
);
insert into tbl2 values ('SQL Server Microsoft'),
('Microsoft Office');
I want to get the percentage of matching string between two tables column name.
I tried with LEVENSHTEIN algorithm. But what I want to achieve from given data is same between the tables but with different sequence so I want to see the output as 100% matching.
Tried: LEVENSHTEIN
SELECT [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) MatchedPercentage,a.name as tbl1_name,b.name as tbl2_name
FROM tbl1 a
CROSS JOIN tbl2 b
WHERE [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) >= 0;
Result:
MatchedPercentage tbl1_name tbl2_name
-----------------------------------------------------------------
5 Mircrosoft SQL Server SQL Server Microsoft
10 Office Microsoft SQL Server Microsoft
15 Mircrosoft SQL Server Microsoft Office
13 Office Microsoft Microsoft Office
As mentioned in the comments this can be achieved through the use of a string split table valued function. Personally I use one based on the very performant set-based tally table approach put together by Jeff Moden which is at the end of my answer.
Using this function allows you to compare the individual words as delimited by a space character and count up the number of matches compared to the total number of words in the two values.
Do note however that this solution falls over on any values with leading spaces. If this will be a problem, clean your data before running this script or adjust to handle them:
declare #t1 table(v nvarchar(50));
declare #t2 table(v nvarchar(50));
insert into #t1 values('Microsoft SQL Server'),('Office Microsoft'),('Other values'); -- Add in some extra values, with the same number of words and some with the same number of characters
insert into #t2 values('SQL Server Microsoft'),('Microsoft Office'),('that matched'),('that didn''t'),('Other valuee');
with c as
(
select t1.v as v1
,t2.v as v2
,len(t1.v) - len(replace(t1.v,' ','')) + 1 as NumWords -- String Length - String Length without spaces = Number of words - 1
from #t1 as t1
cross join #t2 as t2 -- Cross join the two tables to get all comparisons
where len(replace(t1.v,' ','')) = len(replace(t2.v,' ','')) -- Where the length without spaces is the same. Can't have the same words in a different order if the number of non space characters in the whole string is different
)
select c.v1
,c.v2
,c.NumWords
,sum(case when s1.item = s2.item then 1 else 0 end) as MatchedWords
from c
cross apply dbo.fn_StringSplit4k(c.v1,' ',null) as s1
cross apply dbo.fn_StringSplit4k(c.v2,' ',null) as s2
group by c.v1
,c.v2
,c.NumWords
having c.NumWords = sum(case when s1.item = s2.item then 1 else 0 end);
Output
+----------------------+----------------------+----------+--------------+
| v1 | v2 | NumWords | MatchedWords |
+----------------------+----------------------+----------+--------------+
| Microsoft SQL Server | SQL Server Microsoft | 3 | 3 |
| Office Microsoft | Microsoft Office | 2 | 2 |
+----------------------+----------------------+----------+--------------+
Function
create function dbo.fn_StringSplit4k
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
Given strings:
string 1: 'A,B,C,D,E'
string 2: 'X091,X089,X051,X043,X023'
Want to store into the temp table as:
String1 String2
---------------------
A X091
B X089
C X051
D X043
E X023
Tried: Created user defined function named udf_split and inserting into table for each column.
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1) SELECT Item FROM udf_split(#Str1,',');
INSERT INTO #TestString(string2) SELECT Item FROM udf_split(#Str2,',');
But getting following result:
SELECT * FROM #TestString
string1 string2
-----------------
A NULL
B NULL
C NULL
D NULL
E NULL
NULL X091
NULL X089
NULL X051
NULL X043
NULL X023
You need to insert both parts of each row together. Otherwise you end up with each row having one column containing null - which is exactly what happened in your attempt.
First, I would recommend not messing around with comma delimited strings in the database at all, if that's possible.
If you have control over the input, you better use table variables or xml.
If you don't, to split strings in any version under 2016 I would recommend first to read Aaron Bertrand's Split strings the right way – or the next best way. IN 2016 you should use the built in string_split function.
For this kind of thing you want to use a splitting function that returns both the item and it's location in the original string. Lucky for you, Jeff Moden have already written such a splitting function and it's a very popular, high performance function.
You can read all about it on Tally OH! An Improved SQL 8K “CSV Splitter” Function.
So, here is Jeff's function:
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
and here is how you use it:
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1, string2)
SELECT A.Item, B.Item
FROM DelimitedSplit8K(#Str1,',') A
JOIN DelimitedSplit8K(#Str2,',') B
ON A.ItemNumber = B.ItemNumber;
My suggestion uses two independant recursive splits. This allows us to transform the two strings in two sets with a position index. The final SELECT will join the two sets on their position index and return a sorted list:
DECLARE #str1 VARCHAR(1000)= 'A,B,C,D,E';
DECLARE #str2 VARCHAR(1000)= 'X091,X089,X051,X043,X023';
;WITH
--split the first string
a1 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str1, j+1) FROM a1 WHERE j > i)
,b1 AS (SELECT n, SUBSTRING(#str1, i+1, IIF(j>0, j, LEN(#str1)+1)-i-1) s FROM a1 WHERE i >= 0)
--split the second string
,a2 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str2, j+1) FROM a2 WHERE j > i)
,b2 AS (SELECT n, SUBSTRING(#str2, i+1, IIF(j>0, j, LEN(#str2)+1)-i-1) s FROM a2 WHERE i >= 0)
--join them by the index
SELECT b1.n
,b1.s AS s1
,b2.s AS s2
FROM b1
INNER JOIN b2 ON b1.n=b2.n
ORDER BY b1.n;
The result
n s1 s2
1 A X091
2 B X089
3 C X051
4 D X043
5 E X023
UPDATE: If you have v2016+...
With SQL-Server 2016+ you can use OPENJSON with a tiny string replacement to transform your CSV strings to a JSON-array:
SELECT a.[key]
,a.value AS s1
,b.value AS s2
FROM OPENJSON('["' + REPLACE(#str1,',','","') + '"]') a
INNER JOIN(SELECT * FROM OPENJSON('["' + REPLACE(#str2,',','","') + '"]')) b ON a.[key]=b.[key]
ORDER BY a.[key];
Other than STRING_SPLIT() this approach returns the key as zero-based index of the element within the array. The result is the same...
I need to find common substring (without space) of two strings in SQL.
Query:
select *
from tbl as a, tbl as b
where a.str <> b.str
Sample data:
str1 | str2 | max substring without spaces
----------+-----------+-----------------------------
aabcdfbas | rikcdfva | cdf
aaab akuc | aaabir a | aaab
ab akuc | ab atr | ab
I disagree with those who say that SQL is not the tool for this job. Until someone can show me a faster way than my solution in ANY programming language, I will aver that SQL (written in a set-based, side-effect free, using only immutable variables) is the ONLY tool for this job (when dealing with varchar(8000)- or nvarchar(4000)). The solution below is for varchar(8000).
1. A correctly indexed tally (numbers) table.
-- (1) build and populate a persisted (numbers) tally
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (n int not null);
WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N))
INSERT dbo.tally
SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;
-- (2) Add Required constraints (and indexes) for performance
ALTER TABLE dbo.tally
ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;
ALTER TABLE dbo.tally
ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);
Note that a tally table function will not perform as well.
2. Using our tally table to return all possible substrings a string
Using "abcd" as an example, let's get all of it's substrings. Note my comments.
DECLARE #s1 varchar(8000) = 'abcd';
SELECT
position = t.N,
tokenSize = x.N,
string = substring(#s1, t.N, x.N)
FROM dbo.tally t -- token position
CROSS JOIN dbo.tally x -- token length
WHERE t.N <= len(#s1) -- all positions
AND x.N <= len(#s1) -- all lengths
AND len(#s1) - t.N - (x.N-1) >= 0 -- filter unessesary rows [e.g.substring('abcd',3,2)]
This returns
position tokenSize string
----------- ----------- -------
1 1 a
2 1 b
3 1 c
4 1 d
1 2 ab
2 2 bc
3 2 cd
1 3 abc
2 3 bcd
1 4 abcd
3. dbo.getshortstring8K
What's this function about? The first major optimization. We're going to break the shorter of the two strings into every possible substring then see if it exists in the longer string. If you have two strings (S1 and S2) and S1 is longer than S2, we know that none of the substrings of S1, that are longer than S2, will be a substring of S2. That's the purpose of dbo.getshortstring: to ensure that we don't perform any unnecessary substring comparisons. This will make more sense in a moment.
This is hugely important because, the number of substrings in a string can be calculated using a Triangle Number Function. With N as the length (number of characters) in a string, the number of substrings can be calculated as N*(N+1)/2. E.g. "abc" has 6 substrings: 3*(3+1)/2 = 6; a,b,c,ab,bc,abc. If we're comparing "abc" to "abcdefgh" we don't need to check if "abcd" is a substring of "abc".
Breaking "abcdefgh" (length=8) into all possible substrings requires 8*(8+1)/2 = 36 operations (vs 6 for "abc").
IF OBJECT_ID('dbo.getshortstring8k') IS NOT NULL DROP FUNCTION dbo.getshortstring8k;
GO
CREATE FUNCTION dbo.getshortstring8k(#s1 varchar(8000), #s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT s1 = CASE WHEN LEN(#s1) < LEN(#s2) THEN #s1 ELSE #s2 END,
s2 = CASE WHEN LEN(#s1) < LEN(#s2) THEN #s2 ELSE #s1 END;
4. Finding all subsrings of the shorter string that exist in the longer string:
DECLARE #s1 varchar(8000) = 'bcdabc', #s2 varchar(8000) = 'abcd';
SELECT
s.s1, -- test to make sure s.s1 is the shorter of the two strings
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(#s1, #s2) s --<< get the shorter string
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0;
5. Retrieving ONLY the longest common substring(s)
This is the easy part. We simply Add TOP (1) WITH TIES to our SELECT statement and we're all set. Here, the longest common substring is "bc" and "xx"
DECLARE #s1 varchar(8000) = 'xxabcxx', #s2 varchar(8000) = 'bcdxx';
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(#s1, #s2) s
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
ORDER BY x.N DESC;
6. Applying this logic to your table
Using APPLY we replace my variables #s1 and #s2 with the t.str1 & t.str2. I add a filter to exclude matches that contain spaces (see my comments)... And we're off:
-- easily consumbable sample data
DECLARE #yourtable TABLE (str1 varchar(8000), str2 varchar(8000));
INSERT #yourtable
VALUES ('aabcdfbas','rikcdfva'),('aaab akuc','aaabir a'),('ab akuc','ab atr');
SELECT str1, str2, [max substring without spaces] = string
FROM #yourtable t
CROSS APPLY
(
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(t.str1, t.str2) s -- #s1 & #s2 replaced with str1 & str2
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
AND charindex(' ',substring(s.s1, t.N, x.N)) = 0 -- exclude substrings with spaces
ORDER BY x.N DESC
) lcss;
Results:
str1 str2 max substring without spaces
----------- --------- ------------------------------
aabcdfbas rikcdfva cdf
aaab akuc aaabir a aaab
ab akuc ab atr ab
And the execution plan:
... No sorts or unnecessary operations. Just speed. For longer strings (e.g. 50 characters+) I have an even faster technique you can read about here.
Im trying to write a select statement which returns the value if it doesnt have at least 3 of the declared characters but I cant think of how to get it working, can someone point me in the right direction?
One thing to consider, I am not allowed to create a temporary table for this exercise.
I havn't really got any SQL so far as I cant think of a way to do it without a temp table.
the declared characters are any alpha characters between a and z, so if the value in the db is '1873' then it would return the value because it doesnt have at least 3 of the declared characters, but if the value was 'abcdefg' then it would not be returned as it has at least 3 of the declared characters.
Is anyone able to point me in a starting direction for this?
This will find all sys.objects with an x or a z:
Some explanations, as this is an exercise and you want to learn something:
You can split a delimitted string by transforming it into XML. x,z comes out as <x>x</x><x>z</x>. You can use this to create a derived table.
I use a CTE to avoid a created or declared table...
You can use CROSS APPLY for row-wise actions. Here I use CHARINDEX to find the position(s) of the chars you are looking for.
If all of them are not found, there SUM is zero. I use GROUP BY and HAVING to check this.
Hope this is clear :-)
DECLARE #chars VARCHAR(100)='x,z';
WITH Splitted AS
(
SELECT A.B.value('.','char') AS TheChar
FROM
(
SELECT CAST('<x>' + REPLACE(#chars,',','</x><x>')+ '</x>' AS XML) AS AsXml
) AS tbl
CROSS APPLY AsXml.nodes('/x') AS A(B)
)
SELECT name
FROM sys.objects
CROSS APPLY (SELECT CHARINDEX(TheChar,name) AS Found FROM Splitted) AS Found
GROUP BY name,Found
HAVING SUM(Found)>0
With
SrcTab As (
Select *
From (values ('Contains x y z')
, ('Contains x and y')
, ('Contains y only')) v (SrcField)),
CharList As ( --< CTE instead of temporary table
Select *
From (values ('x')
, ('y')
, ('z')) v (c))
Select SrcField
From SrcTab, CharList
Group By SrcField
Having SUM(SIGN(CharIndex(C, SrcField))) < 3 --< Count hits
;
If Distinct is not desirable and we need to only check count for each row:
With
SrcTab As ( --< Sample Data CTE
Select *
From (values ('Contains x y z')
, ('Contains x and y')
, ('Contains y only')
, ('Contains y only')) v (SrcField))
Select SrcField
From SrcTab
Where (
Select Count(*) --< Count hits
From (Values ('x'), ('y'), ('z')) v (c)
Where CharIndex(C, SrcField) > 0
) < 3
;
Using Numbers Table and Joins..I used declared characters as only 4 for demo purposes
Input:
12345
abcdef
ab
Declared table:used only 3 for demo..
a
b
c
Output:
12345
ab
Demo:
---Table population Scripts
Create table #t
(
val varchar(20)
)
insert into #t
select '12345'
union all
select 'abcdef'
union all
select 'ab'
create table #declarecharacters
(
dc char(1)
)
insert into #declarecharacters
select 'a'
union all
select 'b'
union all
select 'c'
Query used
;with cte
as
(
select * from #t
cross apply
(
select substring(val,n,1) as strr from numbers where n<=len(val))b(outputt)
)
select val from
cte c
left join
#declarecharacters dc1
on
dc1.dc=c.outputt
group by val
having
sum(case when dc is null then 0 else 1 end ) <3