Common substring of two string in SQL - sql-server

I need to find common substring (without space) of two strings in SQL.
Query:
select *
from tbl as a, tbl as b
where a.str <> b.str
Sample data:
str1 | str2 | max substring without spaces
----------+-----------+-----------------------------
aabcdfbas | rikcdfva | cdf
aaab akuc | aaabir a | aaab
ab akuc | ab atr | ab

I disagree with those who say that SQL is not the tool for this job. Until someone can show me a faster way than my solution in ANY programming language, I will aver that SQL (written in a set-based, side-effect free, using only immutable variables) is the ONLY tool for this job (when dealing with varchar(8000)- or nvarchar(4000)). The solution below is for varchar(8000).
1. A correctly indexed tally (numbers) table.
-- (1) build and populate a persisted (numbers) tally
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (n int not null);
WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N))
INSERT dbo.tally
SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;
-- (2) Add Required constraints (and indexes) for performance
ALTER TABLE dbo.tally
ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;
ALTER TABLE dbo.tally
ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);
Note that a tally table function will not perform as well.
2. Using our tally table to return all possible substrings a string
Using "abcd" as an example, let's get all of it's substrings. Note my comments.
DECLARE #s1 varchar(8000) = 'abcd';
SELECT
position = t.N,
tokenSize = x.N,
string = substring(#s1, t.N, x.N)
FROM dbo.tally t -- token position
CROSS JOIN dbo.tally x -- token length
WHERE t.N <= len(#s1) -- all positions
AND x.N <= len(#s1) -- all lengths
AND len(#s1) - t.N - (x.N-1) >= 0 -- filter unessesary rows [e.g.substring('abcd',3,2)]
This returns
position tokenSize string
----------- ----------- -------
1 1 a
2 1 b
3 1 c
4 1 d
1 2 ab
2 2 bc
3 2 cd
1 3 abc
2 3 bcd
1 4 abcd
3. dbo.getshortstring8K
What's this function about? The first major optimization. We're going to break the shorter of the two strings into every possible substring then see if it exists in the longer string. If you have two strings (S1 and S2) and S1 is longer than S2, we know that none of the substrings of S1, that are longer than S2, will be a substring of S2. That's the purpose of dbo.getshortstring: to ensure that we don't perform any unnecessary substring comparisons. This will make more sense in a moment.
This is hugely important because, the number of substrings in a string can be calculated using a Triangle Number Function. With N as the length (number of characters) in a string, the number of substrings can be calculated as N*(N+1)/2. E.g. "abc" has 6 substrings: 3*(3+1)/2 = 6; a,b,c,ab,bc,abc. If we're comparing "abc" to "abcdefgh" we don't need to check if "abcd" is a substring of "abc".
Breaking "abcdefgh" (length=8) into all possible substrings requires 8*(8+1)/2 = 36 operations (vs 6 for "abc").
IF OBJECT_ID('dbo.getshortstring8k') IS NOT NULL DROP FUNCTION dbo.getshortstring8k;
GO
CREATE FUNCTION dbo.getshortstring8k(#s1 varchar(8000), #s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT s1 = CASE WHEN LEN(#s1) < LEN(#s2) THEN #s1 ELSE #s2 END,
s2 = CASE WHEN LEN(#s1) < LEN(#s2) THEN #s2 ELSE #s1 END;
4. Finding all subsrings of the shorter string that exist in the longer string:
DECLARE #s1 varchar(8000) = 'bcdabc', #s2 varchar(8000) = 'abcd';
SELECT
s.s1, -- test to make sure s.s1 is the shorter of the two strings
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(#s1, #s2) s --<< get the shorter string
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0;
5. Retrieving ONLY the longest common substring(s)
This is the easy part. We simply Add TOP (1) WITH TIES to our SELECT statement and we're all set. Here, the longest common substring is "bc" and "xx"
DECLARE #s1 varchar(8000) = 'xxabcxx', #s2 varchar(8000) = 'bcdxx';
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(#s1, #s2) s
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
ORDER BY x.N DESC;
6. Applying this logic to your table
Using APPLY we replace my variables #s1 and #s2 with the t.str1 & t.str2. I add a filter to exclude matches that contain spaces (see my comments)... And we're off:
-- easily consumbable sample data
DECLARE #yourtable TABLE (str1 varchar(8000), str2 varchar(8000));
INSERT #yourtable
VALUES ('aabcdfbas','rikcdfva'),('aaab akuc','aaabir a'),('ab akuc','ab atr');
SELECT str1, str2, [max substring without spaces] = string
FROM #yourtable t
CROSS APPLY
(
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(t.str1, t.str2) s -- #s1 & #s2 replaced with str1 & str2
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
AND charindex(' ',substring(s.s1, t.N, x.N)) = 0 -- exclude substrings with spaces
ORDER BY x.N DESC
) lcss;
Results:
str1 str2 max substring without spaces
----------- --------- ------------------------------
aabcdfbas rikcdfva cdf
aaab akuc aaabir a aaab
ab akuc ab atr ab
And the execution plan:
... No sorts or unnecessary operations. Just speed. For longer strings (e.g. 50 characters+) I have an even faster technique you can read about here.

Related

Matching string with LEVENSHTEIN algorithm

create table tbl1
(
name varchar(50)
);
insert into tbl1 values ('Mircrosoft SQL Server'),
('Office Microsoft');
create table tbl2
(
name varchar(50)
);
insert into tbl2 values ('SQL Server Microsoft'),
('Microsoft Office');
I want to get the percentage of matching string between two tables column name.
I tried with LEVENSHTEIN algorithm. But what I want to achieve from given data is same between the tables but with different sequence so I want to see the output as 100% matching.
Tried: LEVENSHTEIN
SELECT [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) MatchedPercentage,a.name as tbl1_name,b.name as tbl2_name
FROM tbl1 a
CROSS JOIN tbl2 b
WHERE [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) >= 0;
Result:
MatchedPercentage tbl1_name tbl2_name
-----------------------------------------------------------------
5 Mircrosoft SQL Server SQL Server Microsoft
10 Office Microsoft SQL Server Microsoft
15 Mircrosoft SQL Server Microsoft Office
13 Office Microsoft Microsoft Office
As mentioned in the comments this can be achieved through the use of a string split table valued function. Personally I use one based on the very performant set-based tally table approach put together by Jeff Moden which is at the end of my answer.
Using this function allows you to compare the individual words as delimited by a space character and count up the number of matches compared to the total number of words in the two values.
Do note however that this solution falls over on any values with leading spaces. If this will be a problem, clean your data before running this script or adjust to handle them:
declare #t1 table(v nvarchar(50));
declare #t2 table(v nvarchar(50));
insert into #t1 values('Microsoft SQL Server'),('Office Microsoft'),('Other values'); -- Add in some extra values, with the same number of words and some with the same number of characters
insert into #t2 values('SQL Server Microsoft'),('Microsoft Office'),('that matched'),('that didn''t'),('Other valuee');
with c as
(
select t1.v as v1
,t2.v as v2
,len(t1.v) - len(replace(t1.v,' ','')) + 1 as NumWords -- String Length - String Length without spaces = Number of words - 1
from #t1 as t1
cross join #t2 as t2 -- Cross join the two tables to get all comparisons
where len(replace(t1.v,' ','')) = len(replace(t2.v,' ','')) -- Where the length without spaces is the same. Can't have the same words in a different order if the number of non space characters in the whole string is different
)
select c.v1
,c.v2
,c.NumWords
,sum(case when s1.item = s2.item then 1 else 0 end) as MatchedWords
from c
cross apply dbo.fn_StringSplit4k(c.v1,' ',null) as s1
cross apply dbo.fn_StringSplit4k(c.v2,' ',null) as s2
group by c.v1
,c.v2
,c.NumWords
having c.NumWords = sum(case when s1.item = s2.item then 1 else 0 end);
Output
+----------------------+----------------------+----------+--------------+
| v1 | v2 | NumWords | MatchedWords |
+----------------------+----------------------+----------+--------------+
| Microsoft SQL Server | SQL Server Microsoft | 3 | 3 |
| Office Microsoft | Microsoft Office | 2 | 2 |
+----------------------+----------------------+----------+--------------+
Function
create function dbo.fn_StringSplit4k
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;

Store multiple comma separated strings into temp table

Given strings:
string 1: 'A,B,C,D,E'
string 2: 'X091,X089,X051,X043,X023'
Want to store into the temp table as:
String1 String2
---------------------
A X091
B X089
C X051
D X043
E X023
Tried: Created user defined function named udf_split and inserting into table for each column.
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1) SELECT Item FROM udf_split(#Str1,',');
INSERT INTO #TestString(string2) SELECT Item FROM udf_split(#Str2,',');
But getting following result:
SELECT * FROM #TestString
string1 string2
-----------------
A NULL
B NULL
C NULL
D NULL
E NULL
NULL X091
NULL X089
NULL X051
NULL X043
NULL X023
You need to insert both parts of each row together. Otherwise you end up with each row having one column containing null - which is exactly what happened in your attempt.
First, I would recommend not messing around with comma delimited strings in the database at all, if that's possible.
If you have control over the input, you better use table variables or xml.
If you don't, to split strings in any version under 2016 I would recommend first to read Aaron Bertrand's Split strings the right way – or the next best way. IN 2016 you should use the built in string_split function.
For this kind of thing you want to use a splitting function that returns both the item and it's location in the original string. Lucky for you, Jeff Moden have already written such a splitting function and it's a very popular, high performance function.
You can read all about it on Tally OH! An Improved SQL 8K “CSV Splitter” Function.
So, here is Jeff's function:
CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
and here is how you use it:
DECLARE #Str1 VARCHAR(MAX) = 'A,B,C,D,E'
DECLARE #Str2 VARCHAR(MAX) = 'X091,X089,X051,X043,X023'
IF OBJECT_ID('tempdb..#TestString') IS NOT NULL DROP TABLE #TestString;
CREATE TABLE #TestString (string1 varchar(100),string2 varchar(100));
INSERT INTO #TestString(string1, string2)
SELECT A.Item, B.Item
FROM DelimitedSplit8K(#Str1,',') A
JOIN DelimitedSplit8K(#Str2,',') B
ON A.ItemNumber = B.ItemNumber;
My suggestion uses two independant recursive splits. This allows us to transform the two strings in two sets with a position index. The final SELECT will join the two sets on their position index and return a sorted list:
DECLARE #str1 VARCHAR(1000)= 'A,B,C,D,E';
DECLARE #str2 VARCHAR(1000)= 'X091,X089,X051,X043,X023';
;WITH
--split the first string
a1 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str1, j+1) FROM a1 WHERE j > i)
,b1 AS (SELECT n, SUBSTRING(#str1, i+1, IIF(j>0, j, LEN(#str1)+1)-i-1) s FROM a1 WHERE i >= 0)
--split the second string
,a2 AS (SELECT n=0, i=-1, j=0 UNION ALL SELECT n+1, j, CHARINDEX(',', #str2, j+1) FROM a2 WHERE j > i)
,b2 AS (SELECT n, SUBSTRING(#str2, i+1, IIF(j>0, j, LEN(#str2)+1)-i-1) s FROM a2 WHERE i >= 0)
--join them by the index
SELECT b1.n
,b1.s AS s1
,b2.s AS s2
FROM b1
INNER JOIN b2 ON b1.n=b2.n
ORDER BY b1.n;
The result
n s1 s2
1 A X091
2 B X089
3 C X051
4 D X043
5 E X023
UPDATE: If you have v2016+...
With SQL-Server 2016+ you can use OPENJSON with a tiny string replacement to transform your CSV strings to a JSON-array:
SELECT a.[key]
,a.value AS s1
,b.value AS s2
FROM OPENJSON('["' + REPLACE(#str1,',','","') + '"]') a
INNER JOIN(SELECT * FROM OPENJSON('["' + REPLACE(#str2,',','","') + '"]')) b ON a.[key]=b.[key]
ORDER BY a.[key];
Other than STRING_SPLIT() this approach returns the key as zero-based index of the element within the array. The result is the same...

SQL, finding discrete overlapping time intervals of multiple ranges against one "key" interval & calculate "most restrictive" common overlaps?

I'm working in SQL Server Management Studio so I suppose this is a Microsoft SQL Server T-SQL question.
The real world scenario is this: I have multiple employees working across multiple locations and the "time in" and "time out" records for each. I have already created a unique "Shift ID" for each set of time intervals and joined, based on employee, date, and location the shifts of other employees that match my keystone employee, or the one against which I am comparing everyone else.
Furthermore, I have written a query that pulled each "other employee's" specific overlapping time interval with the keystone. For one shift the timeline looks like this:
Key Emp. | 9AM------------------------6PM
Emp. A | 9AM------------------------6PM
Emp. B | 12PM-------4PM
So the discrete periods where a true "controlled" comparison can be made are among:
Key, A, and B from 12PM - 4PM
Key and A from 9AM - 12PM
Key and A from 4PM - 6PM
The end goal is to pull all the activity (organized as events with datetime stamps in a separate table) for each employee that occurs within those time periods and compare totals for each relevant employee. So there would be a separate "Count(events)" total for each time frame only affected by the employees that share the time interval as described above.
Currently, my data is organized like this:
the "In" and "Out" columns for key and other employees are stored as TIMESTAMPs; the "1/1,6PM" is just my crappy way of saving space in my example. Please see my consumable data at the end of this post. SSMS doesn't seem to care that I have more than TIMESTAMP column and treats them all like DATETIME:
Key_ShiftID| Key In | Key Out | Oth_Emp_ShiftID | Oth_Emp_In | Oth_Emp_Out
K 1/1,9AM 1/1,6PM A 1/1,9AM 1/1,6PM
K 1/1,9AM 1/1,6PM B 1/1,12PM 1/1,4PM
Where the Shift IDs (Key_ShiftID and Oth_Emp_ShiftID) are unique strings and the time intervals are defined by two columns a piece (Key_In & Key_Out + Oth_Emp_In & Oth_Emp_Out) are stored as datetime/timestamps. I'm looking for discrete periods where I can compare the activity of the employees, which is in a separate table with each event having a unique datetime as was mentioned earlier. Thus, I think the ending data would look something like this:
Key_ShiftID| Key_In | Key_Out | Oth_Emp_ShiftID | Oth_Emp_In | Oth_Emp_Out
K 1/1,9AM 1/1,6PM A 1/1,12PM 1/1,4PM
K 1/1,9AM 1/1,6PM B 1/1,12PM 1/1,4PM
K 1/1,9AM 1/1,6PM A 1/1,9AM 1/1,12PM
K 1/1,9AM 1/1,6PM A 1/1,4PM 1/1,6PM
So I would be able to join the table above to my activity table by ShiftID and bring in the Count(events) per relevant employee
where event_datetime >= Oth_Emp_In and event_datetime <= Oth_Emp_Out
Additionally, as I noted before, I already wrote a query to cut down the non-key employees' shifts to reflect only the time intervals where they overlap with the key employee, so the Other_Emp_In will always be greater than or equal to the Key In time and Other_Emp_Out will always be less than or equal to the Key Out time.
Thanks in advance. I've been researching and struggling with this for around 2 days.
Here's sample data of one key shift (not the exact example above):
Also, SQL Server doesn't seem to care that I have more than TIMESTAMP column and treats them all like DATETIME.
CREATE TABLE "sample_data"
(
"Employee" INT,
"Key_ShiftID" TEXT,
"Key_In" TIMESTAMP,
"Key_Out" TIMESTAMP,
"Other_Emp_ShiftID" TEXT,
"Other_Emp_In" TIMESTAMP,
"Other_Emp_Out" TIMESTAMP,
"overlap_min" TIMESTAMP,
"overlap_max" TIMESTAMP
);
INSERT INTO "sample_data"
VALUES (900, '545BD826-0C9A-408B-BE9F-4C3D7D307948', '2016-09-27 14:15:00', '2016-09-27 21:45:00', '035FA1C1-B469-44EB-B5B4-5B6948574464', '2016-09-27 08:45:00', '2016-09-27 16:15:00', '2016-09-27 14:15:00', '2016-09-27 16:15:00'),
(78, '545BD826-0C9A-408B-BE9F-4C3D7D307948', '2016-09-27 14:15:00', '2016-09-27 21:45:00', '74035838-FD07-4F8D-8AC4-F6407AC786D9', '2016-09-27 18:00:00', '2016-09-27 21:15:00', '2016-09-27 18:00:00', '2016-09-27 21:15:00'),
(900, '545BD826-0C9A-408B-BE9F-4C3D7D307948', '2016-09-27 14:15:00', '2016-09-27 21:45:00', 'D7E9ADCD-8631-476D-B69F-00626F0E4B06', '2016-09-27 16:45:00', '2016-09-27 21:45:00', '2016-09-27 16:45:00', '2016-09-27 21:45:00');
Welcome to StackOverflow. In the future, try to include some easily consumable sample data like what I am including in my solution below.
This is a fun little problem. For this kind of thing I leverage my patExtract8K function which leverages ngrams8K. Here's an example of how to use PatExtract; here I'm extracting money from a string:
SELECT p.*
FROM dbo.patextract8K('Pay me $50.17 now or $1000 later!','[^$0-9.]') AS p;
Results:
itemNumber itemIndex itemLength item
----------- ---------- ----------- --------
1 8 6 $50.17
2 22 5 $1000
Now to tackle your problem:
-- Easily consumable sample data
DECLARE #table TABLE (shiftId VARCHAR(2), empKey VARCHAR(5), workDuration VARCHAR(100));
INSERT #table(shiftId,empKey,workDuration)
VALUES
('K','A','12PM - 4PM'),
('K','B','12PM - 4PM'),
('K','A','9AM - 12PM'),
('K','A','4PM - 6PM');
-- Solution
SELECT
shiftId = f.shiftId,
KeyIn = '1/1,'+REPLACE(CONVERT(VARCHAR(10),
MIN(CAST(f.c1 AS TIME)) OVER (),100),':00',''),
KeyOut = '1/1,'+REPLACE(CONVERT(VARCHAR(10),
MAX(CAST(f.c2 AS TIME)) OVER (),100),':00',''),
empShift = f.empKey,
othEmpIn = '1/1,'+f.c1,
othEmpOut = '1/1,'+f.c2
FROM
(
SELECT t.shiftId, t.empKey, t.workDuration,
c1 = MAX(CASE p.itemNumber WHEN 1 THEN p.item END),
c2 = MAX(CASE p.itemNumber WHEN 2 THEN p.item END)
FROM #table AS t
CROSS APPLY dbo.patExtract8k(t.workDuration, '[^0-9APM]') AS p
CROSS APPLY (VALUES(CAST(p.item AS TIME))) AS tm(N)
GROUP BY t.shiftId, t.empKey, t.workDuration
) AS f;
Results:
shiftId KeyIn KeyOut empShift othEmpIn othEmpOut
------- ---------- ------------ -------- ------------ ------------
K 1/1,9AM 1/1,6PM A 1/1,12PM 1/1,4PM
K 1/1,9AM 1/1,6PM A 1/1,4PM 1/1,6PM
K 1/1,9AM 1/1,6PM A 1/1,9AM 1/1,12PM
K 1/1,9AM 1/1,6PM B 1/1,12PM 1/1,4PM
Note that I have no idea where the "1/1" is coming from so I just hard-coded that in.
Here's my underlying functions. All are very helpful for solving a wide array of SQL issues efficiently and with little code.
CREATE FUNCTION dbo.rangeAB
(
#low bigint,
#high bigint,
#gap bigint,
#row1 bit
)
/****************************************************************************************
[Purpose]:
Creates up to 531,441,000,000 sequentia1 integers numbers beginning with #low and ending
with #high. Used to replace iterative methods such as loops, cursors and recursive CTEs
to solve SQL problems. Based on Itzik Ben-Gan's getnums function with some tweeks and
enhancements and added functionality. The logic for getting rn to begin at 0 or 1 is
based comes from Jeff Moden's fnTally function.
The name range because it's similar to clojure's range function. The name "rangeAB" as
used because "range" is a reserved SQL keyword.
[Author]: Alan Burstein
[Compatibility]:
SQL Server 2008+ and Azure SQL Database
[Syntax]:
SELECT r.RN, r.OP, r.N1, r.N2
FROM dbo.rangeAB(#low,#high,#gap,#row1) AS r;
[Parameters]:
#low = a bigint that represents the lowest value for n1.
#high = a bigint that represents the highest value for n1.
#gap = a bigint that represents how much n1 and n2 will increase each row; #gap also
represents the difference between n1 and n2.
#row1 = a bit that represents the first value of rn. When #row = 0 then rn begins
at 0, when #row = 1 then rn will begin at 1.
[Returns]:
Inline Table Valued Function returns:
rn = bigint; a row number that works just like T-SQL ROW_NUMBER() except that it can
start at 0 or 1 which is dictated by #row1.
op = bigint; returns the "opposite number that relates to rn. When rn begins with 0 and
ends with 10 then 10 is the opposite of 0, 9 the opposite of 1, etc. When rn begins
with 1 and ends with 5 then 1 is the opposite of 5, 2 the opposite of 4, etc...
n1 = bigint; a sequential number starting at the value of #low and incrimentingby the
value of #gap until it is less than or equal to the value of #high.
n2 = bigint; a sequential number starting at the value of #low+#gap and incrimenting
by the value of #gap.
[Dependencies]:
N/A
[Developer Notes]:
1. The lowest and highest possible numbers returned are whatever is allowable by a
bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3).
2. #gap does not affect rn, rn will begin at #row1 and increase by 1 until the last row
unless its used in a query where a filter is applied to rn.
3. #gap must be greater than 0 or the function will not return any rows.
4. Keep in mind that when #row1 is 0 then the highest row-number will be the number of
rows returned minus 1
5. If you only need is a sequential set beginning at 0 or 1 then, for best performance
use the RN column. Use N1 and/or N2 when you need to begin your sequence at any
number other than 0 or 1 or if you need a gap between your sequence of numbers.
6. Although #gap is a bigint it must be a positive integer or the function will
not return any rows.
7. The function will not return any rows when one of the following conditions are true:
* any of the input parameters are NULL
* #high is less than #low
* #gap is not greater than 0
To force the function to return all NULLs instead of not returning anything you can
add the following code to the end of the query:
UNION ALL
SELECT NULL, NULL, NULL, NULL
WHERE NOT (#high&#low&#gap&#row1 IS NOT NULL AND #high >= #low AND #gap > 0)
This code was excluded as it adds a ~5% performance penalty.
8. There is no performance penalty for sorting by rn ASC; there is a large performance
penalty for sorting in descending order WHEN #row1 = 1; WHEN #row1 = 0
If you need a descending sort the use op in place of rn then sort by rn ASC.
Best Practices:
--===== 1. Using RN (rownumber)
-- (1.1) The best way to get the numbers 1,2,3...#high (e.g. 1 to 5):
SELECT RN FROM dbo.rangeAB(1,5,1,1);
-- (1.2) The best way to get the numbers 0,1,2...#high-1 (e.g. 0 to 5):
SELECT RN FROM dbo.rangeAB(0,5,1,0);
--===== 2. Using OP for descending sorts without a performance penalty
-- (2.1) The best way to get the numbers 5,4,3...#high (e.g. 5 to 1):
SELECT op FROM dbo.rangeAB(1,5,1,1) ORDER BY rn ASC;
-- (2.2) The best way to get the numbers 0,1,2...#high-1 (e.g. 5 to 0):
SELECT op FROM dbo.rangeAB(1,6,1,0) ORDER BY rn ASC;
--===== 3. Using N1
-- (3.1) To begin with numbers other than 0 or 1 use N1 (e.g. -3 to 3):
SELECT N1 FROM dbo.rangeAB(-3,3,1,1);
-- (3.2) ROW_NUMBER() is built in. If you want a ROW_NUMBER() include RN:
SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,1);
-- (3.3) If you wanted a ROW_NUMBER() that started at 0 you would do this:
SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,0);
--===== 4. Using N2 and #gap
-- (4.1) To get 0,10,20,30...100, set #low to 0, #high to 100 and #gap to 10:
SELECT N1 FROM dbo.rangeAB(0,100,10,1);
-- (4.2) Note that N2=N1+#gap; this allows you to create a sequence of ranges.
-- For example, to get (0,10),(10,20),(20,30).... (90,100):
SELECT N1, N2 FROM dbo.rangeAB(0,90,10,1);
-- (4.3) Remember that a rownumber is included and it can begin at 0 or 1:
SELECT RN, N1, N2 FROM dbo.rangeAB(0,90,10,1);
[Examples]:
--===== 1. Generating Sample data (using rangeAB to create "dummy rows")
-- The query below will generate 10,000 ids and random numbers between 50,000 and 500,000
SELECT
someId = r.rn,
someNumer = ABS(CHECKSUM(NEWID())%450000)+50001
FROM rangeAB(1,10000,1,1) r;
--===== 2. Create a series of dates; rn is 0 to include the first date in the series
DECLARE #startdate DATE = '20180101', #enddate DATE = '20180131';
SELECT r.rn, calDate = DATEADD(dd, r.rn, #startdate)
FROM dbo.rangeAB(1, DATEDIFF(dd,#startdate,#enddate),1,0) r;
GO
--===== 3. Splitting (tokenizing) a string with fixed sized items
-- given a delimited string of identifiers that are always 7 characters long
DECLARE #string VARCHAR(1000) = 'A601225,B435223,G008081,R678567';
SELECT
itemNumber = r.rn, -- item's ordinal position
itemIndex = r.n1, -- item's position in the string (it's CHARINDEX value)
item = SUBSTRING(#string, r.n1, 7) -- item (token)
FROM dbo.rangeAB(1, LEN(#string), 8,1) r;
GO
--===== 4. Splitting (tokenizing) a string with random delimiters
DECLARE #string VARCHAR(1000) = 'ABC123,999F,XX,9994443335';
SELECT
itemNumber = ROW_NUMBER() OVER (ORDER BY r.rn), -- item's ordinal position
itemIndex = r.n1+1, -- item's position in the string (it's CHARINDEX value)
item = SUBSTRING
(
#string,
r.n1+1,
ISNULL(NULLIF(CHARINDEX(',',#string,r.n1+1),0)-r.n1-1, 8000)
) -- item (token)
FROM dbo.rangeAB(0,DATALENGTH(#string),1,1) r
WHERE SUBSTRING(#string,r.n1,1) = ',' OR r.n1 = 0;
-- logic borrowed from: http://www.sqlservercentral.com/articles/Tally+Table/72993/
--===== 5. Grouping by a weekly intervals
-- 5.1. how to create a series of start/end dates between #startDate & #endDate
DECLARE #startDate DATE = '1/1/2015', #endDate DATE = '2/1/2015';
SELECT
WeekNbr = r.RN,
WeekStart = DATEADD(DAY,r.N1,#StartDate),
WeekEnd = DATEADD(DAY,r.N2-1,#StartDate)
FROM dbo.rangeAB(0,datediff(DAY,#StartDate,#EndDate),7,1) r;
GO
-- 5.2. LEFT JOIN to the weekly interval table
BEGIN
DECLARE #startDate datetime = '1/1/2015', #endDate datetime = '2/1/2015';
-- sample data
DECLARE #loans TABLE (loID INT, lockDate DATE);
INSERT #loans SELECT r.rn, DATEADD(dd, ABS(CHECKSUM(NEWID())%32), #startDate)
FROM dbo.rangeAB(1,50,1,1) r;
-- solution
SELECT
WeekNbr = r.RN,
WeekStart = dt.WeekStart,
WeekEnd = dt.WeekEnd,
total = COUNT(l.lockDate)
FROM dbo.rangeAB(0,datediff(DAY,#StartDate,#EndDate),7,1) r
CROSS APPLY (VALUES (
CAST(DATEADD(DAY,r.N1,#StartDate) AS DATE),
CAST(DATEADD(DAY,r.N2-1,#StartDate) AS DATE))) dt(WeekStart,WeekEnd)
LEFT JOIN #loans l ON l.lockDate BETWEEN dt.WeekStart AND dt.WeekEnd
GROUP BY r.RN, dt.WeekStart, dt.WeekEnd ;
END;
--===== 6. Identify the first vowel and last vowel in a along with their positions
DECLARE #string VARCHAR(200) = 'This string has vowels';
SELECT TOP(1) position = r.rn, letter = SUBSTRING(#string,r.rn,1)
FROM dbo.rangeAB(1,LEN(#string),1,1) r
WHERE SUBSTRING(#string,r.rn,1) LIKE '%[aeiou]%'
ORDER BY r.rn;
-- To avoid a sort in the execution plan we'll use op instead of rn
SELECT TOP(1) position = r.op, letter = SUBSTRING(#string,r.op,1)
FROM dbo.rangeAB(1,LEN(#string),1,1) r
WHERE SUBSTRING(#string,r.rn,1) LIKE '%[aeiou]%'
ORDER BY r.rn;
---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20140518 - Initial Development - Alan Burstein
Rev 01 - 20151029 - Added 65 rows to make L1=465; 465^3=100.5M. Updated comment section
- Alan Burstein
Rev 02 - 20180613 - Complete re-design including opposite number column (op)
Rev 03 - 20180920 - Added additional CROSS JOIN to L2 for 530B rows max - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS
(
SELECT 1
FROM (VALUES
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT
r.RN,
r.OP,
r.N1,
r.N2
FROM
(
SELECT
RN = 0,
OP = (#high-#low)/#gap,
N1 = #low,
N2 = #gap+#low
WHERE #row1 = 0
UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
SELECT TOP (ABS((COALESCE(#high,0)-COALESCE(#low,0))/COALESCE(#gap,0)+COALESCE(#row1,1)))
RN = i.rn,
OP = (#high-#low)/#gap+(2*#row1)-i.rn,
N1 = (i.rn-#row1)*#gap+#low,
N2 = (i.rn-(#row1-1))*#gap+#low
FROM iTally AS i
ORDER BY rn
) AS r
WHERE #high&#low&#gap&#row1 IS NOT NULL AND #high >= #low AND #gap > 0;
GO
IF OBJECT_ID('dbo.NGrams8k', 'IF') IS NOT NULL DROP FUNCTION dbo.NGrams8k;
GO
CREATE FUNCTION dbo.NGrams8k
(
#string VARCHAR(8000), -- Input string
#N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.position, ng.token
FROM dbo.NGrams8k(#string,#N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable AS s
CROSS APPLY dbo.NGrams8K(s.SomeValue,#N) AS ng;
[Parameters]:
#string = The input string to split into tokens.
#N = The size of each token returned.
[Returns]:
Position = BIGINT; the position of the token in the input string
token = VARCHAR(8000); a #N-sized character-level N-Gram token
[Dependencies]:
1. dbo.rangeAB (iTVF)
[Developer Notes]:
1. NGrams8k is not case sensitive;
2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not choose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL)
4. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Split the string, "abcd" into unigrams, bigrams and trigrams
SELECT ng.position, ng.token FROM dbo.NGrams8k('abcd',1) AS ng; -- unigrams (#N=1)
SELECT ng.position, ng.token FROM dbo.NGrams8k('abcd',2) AS ng; -- bigrams (#N=2)
SELECT ng.position, ng.token FROM dbo.NGrams8k('abcd',3) AS ng; -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string varchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGrams8k(t.string,2) AS ng
WHERE ng.token = 'AB'
GROUP BY string;
[Revision History]:
------------------------------------------------------------------------------------------
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if #N is greater than 0 and less
than the length of #string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the #string and #N
parameters to prevent a NULL string or NULL #N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
Rev 05 - 20171228 - Small simplification; changed:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(#string,''))-(ISNULL(#N,1)-1)),0)))
to:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(#string,''))+1-ISNULL(#N,1)),0)))
Rev 06 - 20180612 - Using CHECKSUM(N) in the to convert N in the token output instead of
using (CAST N as int). CHECKSUM removes the need to convert to int.
Rev 07 - 20180612 - re-designed to: (1) use dbo.rangeAB - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
position = r.RN,
token = SUBSTRING(#string, CHECKSUM(r.RN), #N)
FROM dbo.rangeAB(1, LEN(#string)+1-#N,1,1) AS r
WHERE #N > 0 AND #N <= LEN(#string);
GO
CREATE FUNCTION dbo.patExtract8K
(
#string VARCHAR(8000),
#pattern VARCHAR(50)
)
/*****************************************************************************************
[Description]:
This can be considered a T-SQL inline table valued function (iTVF) equivalent of
Microsoft's mdq.RegexExtract except:
1. It includes each matching substring's position in the string
2. It accepts varchar(8000) instead of nvarchar(4000) for the input string, varchar(50)
instead of nvarchar(4000) for the pattern
3. The mask parameter is not required and therefore does not exist.
4. You have specify what text we're searching for as an exclusion; e.g. for numeric
characters you should search for '[^0-9]' instead of '[0-9]'.
5. There is is no parameter for naming a "capture group". Using the variable below, both
the following queries will return the same result:
DECLARE #string nvarchar(4000) = N'123 Main Street';
SELECT item FROM dbo.patExtract8K(#string, '[^0-9]');
SELECT clr.RegexExtract(#string, N'(?<number>(\d+))(?<street>(.*))', N'number', 1);
Alternatively, you can think of patExtract8K as Chris Morris' PatternSplitCM (found here:
http://www.sqlservercentral.com/articles/String+Manipulation/94365/) but only returns the
rows where [matched]=0. The key benefit of is that it performs substantially better
because you are only returning the number of rows required instead of returning twice as
many rows then filtering out half of them. Furthermore, because we're
The following two sets of queries return the same result:
DECLARE #string varchar(100) = 'xx123xx555xx999';
BEGIN
-- QUERY #1
-- patExtract8K
SELECT ps.itemNumber, ps.item
FROM dbo.patExtract8K(#string, '[^0-9]') ps;
-- patternSplitCM
SELECT itemNumber = row_number() over (order by ps.itemNumber), ps.item
FROM dbo.patternSplitCM(#string, '[^0-9]') ps
WHERE [matched] = 0;
-- QUERY #2
SELECT ps.itemNumber, ps.item
FROM dbo.patExtract8K(#string, '[0-9]') ps;
SELECT itemNumber = row_number() over (order by itemNumber), item
FROM dbo.patternSplitCM(#string, '[0-9]')
WHERE [matched] = 0;
END;
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous
SELECT pe.ItemNumber, pe.ItemIndex, pe.ItemLength, pe.Item
FROM dbo.patExtract8K(#string,#pattern) pe;
--===== Against a table using APPLY
SELECT t.someString, pe.ItemIndex, pe.ItemLength, pe.Item
FROM dbo.SomeTable t
CROSS APPLY dbo.patExtract8K(t.someString, #pattern) pe;
[Parameters]:
#string = varchar(8000); the input string
#searchString = varchar(50); pattern to search for
[Returns]:
itemNumber = bigint; the instance or ordinal position of the matched substring
itemIndex = bigint; the location of the matched substring inside the input string
itemLength = int; the length of the matched substring
item = varchar(8000); the returned text
[Developer Notes]:
1. Requires NGrams8k
2. patExtract8K does not return any rows on NULL or empty strings. Consider using
OUTER APPLY or append the function with the code below to force the function to return
a row on emply or NULL inputs:
UNION ALL SELECT 1, 0, NULL, #string WHERE nullif(#string,'') IS NULL;
3. patExtract8K is not case sensitive; use a case sensitive collation for
case-sensitive comparisons
4. patExtract8K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
5. patExtract8K performs substantially better with a parallel execution plan, often
2-3 times faster. For queries that leverage patextract8K that are not getting a
parallel exeution plan you should consider performance testing using Traceflag 8649
in Development environments and Adam Machanic's make_parallel in production.
[Examples]:
--===== (1) Basic extact all groups of numbers:
WITH temp(id, txt) as
(
SELECT * FROM (values
(1, 'hello 123 fff 1234567 and today;""o999999999 tester 44444444444444 done'),
(2, 'syat 123 ff tyui( 1234567 and today 999999999 tester 777777 done'),
(3, '&**OOOOO=+ + + // ==?76543// and today !!222222\\\tester{}))22222444 done'))t(x,xx)
)
SELECT
[temp.id] = t.id,
pe.itemNumber,
pe.itemIndex,
pe.itemLength,
pe.item
FROM temp AS t
CROSS APPLY dbo.patExtract8K(t.txt, '[^0-9]') AS pe;
-----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170801 - Initial Development - Alan Burstein
Rev 01 - 20180619 - Complete re-write - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT itemNumber = ROW_NUMBER() OVER (ORDER BY f.position),
itemIndex = f.position,
itemLength = itemLen.l,
item = SUBSTRING(f.token, 1, itemLen.l)
FROM
(
SELECT ng.position, SUBSTRING(#string,ng.position,DATALENGTH(#string))
FROM dbo.NGrams8k(#string, 1) AS ng
WHERE PATINDEX(#pattern, ng.token) < --<< this token does NOT match the pattern
ABS(SIGN(ng.position-1)-1) + --<< are you the first row? OR
PATINDEX(#pattern,SUBSTRING(#string,ng.position-1,1)) --<< always 0 for 1st row
) AS f(position, token)
CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+#pattern+'%',f.token),0),
DATALENGTH(#string)+2-f.position)-1)) AS itemLen(l);
GO

How to Capture text from DB column in SQL Server 2012?

I have a column notes with a length of more than 80,0000 characters.
As per the transformation rule, I have to write a SQL script which will caption the notes column in the below steps :
First 300 characters in Column_A
Next 300 characters in Column_B
Next 300 characters in Column_C
and so on.
So I am looking for a output as below :
For every client ID with end of the length of the notes column.
Ouch! That's quite a complex requirement. You will need to combine a number of skills to solve this one.
Firstly you need to create additional rows. One way to achieve this is via recursion. In the example below I've calculated how many rows are required for each Client Id. I've then used recursion to create them.
You also need to break each row into 3 300 character blocks. In my example I've used 3 3 character blocks instead, so it's easier to read. But the principle will scale up. Using SUBSTRING and the record number you can calculate the starting point for each column.
I've created some sample records in a CTE called Raw. This allows anyone to follow the example, which is up on Stack Data Exchange (link below).
Example
DECLARE #ColumnWidth INT = 3; -- Use to adjust required length of columns A, B and C.
DECLARE #ColumnCount INT = 3; -- Use to adjust number of output columns.
WITH [Raw] AS
(
/* This CTE creates sample records for us to experiment with.
* The note column contains each letter of the alphabet, repeated
* 3 times. The repeatition will help us validate the result set.
*
* Using ceiling, to round up, the field length (#ColumnWidth) and
* the number of fields (#ColumnCount) and the number of charaters (LEN)
* we can calculate how many rows are required.
*/
SELECT
r.ClientId,
r.Note,
CEILING(CAST(LEN(r.Note) AS DECIMAL(18, 8)) / (#ColumnWidth * #ColumnCount)) AS RecordsRequired
FROM
(
VALUES
(1, 'aaabbbcccdddeeefffggghhhiiijjjkkklllmmmnnnooopppqqqrrrssstttuuuvvvwwwxxxyyyzz'),
(2, 'aaabbbcccdddeeefffggghhhiiijjjkkklll'),
(3, 'aaabbbcccdddeeefffggghhhiiijjjkkklllmmmnnno'),
(4, 'aaabbbcccdddeeefffggghhhiiijjjkkklllmmmnnnoooppp'),
(5, 'aaabbbcccdddeeefffggghhhiiijjj'),
(6, 'aaabbbcccdd')
) AS r(ClientId, Note)
),
MultiRow AS
(
/* This CTE uses recursion to return multiple rows for
* each orginal row.
* The number returned matches the RecordsRequired value
* from the Raw CTE.
*/
SELECT
1 AS RecordNumber,
RecordsRequired,
ClientId,
Note
FROM
[Raw]
UNION ALL
-- Keep repeating each record until the number of required rows has been returned.
SELECT
RecordNumber + 1 AS RecordNumber,
RecordsRequired,
ClientId,
Note
FROM
MultiRow
WHERE
RecordNumber < RecordsRequired
)
/* Each record returned by the MultiRow CTE is numbered: 1, 2, 3 etc.
* Using this we can extract blocks of text from the orginal Note column.
*/
SELECT
ClientId,
SUBSTRING(Note, ((#ColumnWidth * #ColumnCount) * RecordNumber) - ((#ColumnWidth * 3) -1), #ColumnWidth) AS Column_A,
SUBSTRING(Note, ((#ColumnWidth * #ColumnCount) * RecordNumber) - ((#ColumnWidth * 2) -1), #ColumnWidth) AS Column_B,
SUBSTRING(Note, ((#ColumnWidth * #ColumnCount) * RecordNumber) - ((#ColumnWidth * 1) -1), #ColumnWidth) AS Column_C
FROM
MultiRow
ORDER BY
ClientId, RecordNumber
;
Here is how you can do this:
DECLARE #c TABLE(ID INT, Notes VARCHAR(26))
INSERT INTO #c VALUES
(1, 'abcdefghijklmnopqrstuvwxyz'),
(2, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
DECLARE #size INT = 26
DECLARE #chunk INT = 5
;WITH tally AS(SELECT 1 s1, #chunk + 1 s2, 2*#chunk + 1 s3
UNION ALL
SELECT s3 + #chunk, s3 + 2*#chunk, s3 + 3*#chunk FROM tally
WHERE s3 < #size)
SELECT c.ID,
SUBSTRING(Notes, t.s1, #chunk) A,
SUBSTRING(Notes, t.s2, #chunk) B,
SUBSTRING(Notes, t.s3, #chunk) C
FROM #c c
CROSS JOIN tally t
ORDER BY c.ID, t.s1
Output:
ID A B C
1 abcde fghij klmno
1 pqrst uvwxy z
2 ABCDE FGHIJ KLMNO
2 PQRST UVWXY Z
Description:
tally table returns you the starting positions, which you will use in substring function. For the above configuration it returns:
s1 s2 s3
1 6 11
16 21 26
For this you are using recursive cte which spreads starting positions across the rows with 3 starting position. The rest should be easy to understand.
The required result can be obtained using simple looping
/* Declare a temperory table for storing the results */
DECLARE #Result_TABLE AS TABLE
(
CustomerId BIGINT
,ColA VARCHAR(300)
,ColB VARCHAR(300)
,ColC VARCHAR(300)
)
DECLARE #CustomerCount INT --To store customer count
DECLARE #IteratorForCustomers INT = 1 --To iterate for each customers
/* Get Count of cutomers */
SELECT #CustomerCount = COUNT (1) FROM Customers
DECLARE #CustomerId BIGINT --To store customer id in looping
DECLARE #TempNote VARCHAR(MAX) -- To store customer note of each customer in looping
/* Loop for all customers */
WHILE (#IteratorForCustomers <=#CustomerCount)
BEGIN
;WITH CTE AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY CustomerID ) AS RowId
,CustomerId
,Customer_Note
FROM Customers
)
SELECT
#CustomerId = a.CustomerId
,#TempNote = a.Customer_Note
FROM CTE a
WHERE
RowId = #IteratorForCustomers
/* Loop for generating each row with three columns */
WHILE (LEN(#TempNote)>0)
BEGIN
INSERT INTO #Result_TABLE
VALUES
(
#CustomerId,SUBSTRING(#TempNote,1,300),SUBSTRING(#TempNote,301,300),SUBSTRING(#TempNote,601,300)
)
SET #TempNote = CASE WHEN LEN(#TempNote)>900 THEN SUBSTRING(#TempNote,901,LEN(#TempNote)-900)
ELSE NULL END
END
SET #IteratorForCustomers = #IteratorForCustomers + 1
END
SELECT * FROM #Result_TABLE

sql query to split a single column value into multiple rows

I want to create a sql query to split a single column value into multiple rows like:
SELECT ID, PRODUCT_COUNT FROM MERCHANT WHERE ID = 3050
ID PRODUCT_COUNT
----------- -------------
3050 591
Based on this result, I want 6 rows as follows:
ID RANGE
3050 0-100
3050 101-200
3050 201-300
3050 301-400
3050 401-500
3050 501-591
How can I acheive this in a query ?
WITH cte AS (
SELECT
m.ID,
PRODUCT_COUNT,
LoBound = (v.number - 1) * 100 + 1,
HiBound = v.number * 100
FROM MERCHANT m
INNER JOIN master..spt_values v
ON v.type = 'P' AND v.number BETWEEN 1 AND (m.PRODUCT_COUNT - 1) / 100 + 1
WHERE m.ID = 3050
)
SELECT
ID,
RANGE = CAST(CASE LoBound
WHEN 1 THEN 0
ELSE LoBound
END AS varchar)
+ '-'
+ CAST(CASE
WHEN HiBound < PRODUCT_COUNT THEN HiBound
ELSE PRODUCT_COUNT
END AS varchar)
FROM cte
The first CASE makes sure the first range starts with 0, not with 1, same as in your sample output.
Sorry... code removed. I made a mistake where if the Product_Count was evenly divisible by 100, it gave an incorrect final row.
UPDATE:
Andriy's code is still correct. I was missing a "-1" in mine. I've repaired that and reposted both the test setup and my alternative solution.
Both Andriy's and my code produce the output in the correct order for this experiment, but I added an ORDER BY to guarantee it.
Here's the code for the test setup...
--===== Conditionally drop and create a test table for
-- everyone to work against.
IF OBJECT_ID('tempdb..#Merchant','U') IS NOT NULL
DROP TABLE #Merchant
;
SELECT TOP 10000
ID = IDENTITY(INT,1,1),
Product_Count = ABS(CHECKSUM(NEWID()))%100000
INTO #Merchant
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
ALTER TABLE #Merchant
ADD PRIMARY KEY CLUSTERED (ID)
;
--===== Make several entries where there's a known test setup.
UPDATE #Merchant
SET Product_Count = CASE
WHEN ID = 1 THEN 0
WHEN ID = 2 THEN 1
WHEN ID = 3 THEN 99
WHEN ID = 4 THEN 100
WHEN ID = 5 THEN 101
WHEN ID = 6 THEN 99999
WHEN ID = 7 THEN 100000
WHEN ID = 8 THEN 100001
END
WHERE ID < = 8
;
Here's the alternative I posted before with the -1 correction.
WITH
cteCreateRanges AS
(--==== This determines what the ranges are
SELECT m.ID,
RangeStart = t.Number*100+SIGN(t.Number),
RangeEnd =(t.Number+1)*100,
Product_Count
FROM master.dbo.spt_Values t
CROSS JOIN #Merchant m
WHERE t.Number BETWEEN 0 AND (m.Product_Count-1)/100
AND t.Type = 'P'
AND m.ID BETWEEN 1 AND 8 -- = #FindID -<<<---<<< Or use a single variable to find.
)--==== This makes the output "pretty" and sorts in correct order
SELECT ID,
[Range] = CAST(RangeStart AS VARCHAR(10)) + '-'
+ CASE
WHEN RangeEnd <= Product_Count
THEN CAST(RangeEnd AS VARCHAR(10))
ELSE CAST(Product_Count AS VARCHAR(10))
END
FROM cteCreateRanges
ORDER BY ID, RangeStart
;
Sorry about the earlier mistake. Thanks, Andriy, for catching it.
You could create a table like this (I am changing the first range to include 100 elements like the others to make it easier, and basing it at one, so that the indexes will match the total count):
CountRangeBoundary
MinIndexInRange
---------------
1
101
201
301
401
501
601
...
Then do a θ-join like this:
SELECT m.ID,
crb.MinIndexInRange AS RANGE_MIN,
MIN( crb.MinIndexInRange + 100, m.PRODUCT_COUNT) AS RANGE_MAX
FROM MERCHANT m
JOIN CountRangeBoundry crb ON crb.MinIndexInRange <= m.PRODUCT_COUNT
WHERE m.ID = 3050
It looks like those ranges are a piece of data, so they should really be in a table (even if you don't expect them to change, because they will). That has the nice side benefit of making this task trivial:
CREATE TABLE My_Ranges ( -- Use a more descriptive name
range_start SMALLINT NOT NULL,
range_end SMALLINT NOT NULL,
CONSTRAINT PK_My_Ranges PRIMARY KEY CLUSTERED (range_start)
)
GO
SELECT
P.id,
R.range_start,
CASE
WHEN R.range_end < P.product_count THEN R.range_end
ELSE P.product_count
END AS range_end
FROM
Products P
INNER JOIN My_Ranges R ON
R.range_start <= P.product_count
If your ranges will always be contiguous then you can omit the range_end column. Your query will become a little more complex, but you won't have to worry about ranges overlapping or gaps in your ranges.
You can try a recursive CTE.
WITH CTE AS
(
SELECT Id, 0 MinB, 100 MaxB, [Range]
FROM YourTable
UNION ALL
SELECT Id, CASE WHEN MinB = 0 THEN MinB+101 ELSE MinB+100 END, MaxB + 100, [Range]
FROM CTE
WHERE MinB < [Range]
)
SELECT Id,
CAST(MinB AS VARCHAR) + ' - ' + CAST(CASE WHEN MaxB>[Range] THEN [Range] ELSE MaxB END AS VARCHAR) [Range]
FROM CTE
WHERE MinB < [Range]
ORDER BY Id, [Range]
OPTION(MAXRECURSION 5000)
I put a limit to the recursion level on 5000, but you can change it (or leave it at zero, that means basically to keep doing recursion until it can)

Resources